hive streaming

1. hive 的streaming 包括: map(), reduce(), transform()，常用的为transform2. 恒等变换select transform(name, salary) using "/bin/cat" as new_name, new_salary from employees where country = 'CHINA';

eiffel_0311

75606人浏览 · 2016-05-10 20:40:46

eiffel_0311 · 2016-05-10 20:40:46 发布

1. hive 的streaming 包括:

map(), reduce(), transform()，常用的为 transform

2. 恒等变换

select transform(name, salary) using "/bin/cat" as new_name, new_salary from employees where country = 'CHINA';

3. 改变类型

select transform(name, salary) using "/bin/cat" as (new_name string, new_salary int) from employees where country = 'CHINA';

4. 投影变换

select transform(name, salary) using "/bin/cut -f1,2" as (new_name string, new_salary int) from employees where country = 'CHINA';

cut 命令可以任意选取及格字段

5. 操作变换

select transform(name, salary) using "/bin/sed s/John/Tom/" as (new_name string, new_salary int) from employees where country = 'CHINA';

*********************************

6. 调用脚本处理

1. 创建脚本：test2.py 每行数据名字改大写，工资一律加 500

import sys

for line in sys.stdin:

try:

line = line.strip()

content = line.split("\t")

name = content[0].upper()

salay = float(content[1].strip()) + 5000

print "%s\t%f" %(name, salay)

except Exception, err:

continue

2. 添加文件：

add file /opt/hive/current/testscript/test2.py;

3. 调用脚本：

select new_salary from(select transform(name, salary) using "python test2.py" as new_name, new_salary from employees where country = 'CHINA') as tmp;

多次调用transform可以实现MapReduce的操作

大数据技术专区

大数据从业者之家,一起探索大数据的无限可能！

更多推荐

SQL：数据去重的三种方法

SQL中去除重复数据

大数据技术专区

一文通览腾讯云大数据ES、数据湖计算、云数据仓库产品新版本技术创新

大数据技术专区

【大数据实训】基于Hive的北京市天气系统分析报告(二)

大数据技术专区

所有评论(0)

查看更多评论

eiffel_0311

@eiffel_0311

已为社区贡献4条内容