将Hive数据导入到外部表

1. 说明最近接到一个需求，需要将Hive数据导出到ES。根据调研情况，可以在Hive上创建外部表，通过写SQL的形式将数据导出到Es，有个注意事项需要说明的是：写入到Es中的数据是无法覆盖的。根据使用情况，简单总结了一下。2. 环境说明Hive：2.1.1ElasticSearch：7.17.0Hadoop：3.0.0Spark：3.1.23. Es搭建3.1 解压# 解压tar -zxvf

奔向大数据的凡小王

3209人浏览 · 2022-03-17 20:49:18

奔向大数据的凡小王 · 2022-03-17 20:49:18 发布

1. 说明

最近接到一个需求，需要将Hive数据导出到ES。根据调研情况，可以在Hive上创建外部表，通过写SQL的形式将数据导出到Es，有个注意事项需要说明的是：写入到Es中的数据是无法覆盖的。根据使用情况，简单总结了一下。

2. 环境说明

Hive：2.1.1
ElasticSearch：7.17.0
Hadoop：3.0.0
Spark：3.1.2

3. Es搭建

3.1 解压

# 解压
tar -zxvf elasticsearch-7.17.0-linux-x86_64.tar.gz

3.2 修改权限

chown -R elasticsearch:elasticsearch elasticsearch-7.17.0

3.3 配置

elasticsearch.yml

# 集群名称
cluster.name: elasticsearch-cluster
# 节点名，各节点不一致
node.name: "node-1"
# 是否有资格选举为master
node.master: true
# 是否是数据节点
node.data: true
# 自定义的属性，这是官方文档中自带的
node.attr.rack: r1
# 默认分片数
index.number_of_shards: 5
# 默认分片副本数
index.number_of_replicas: 3
# 数据路径
path.data: /opt/data1/es_data
# 日志路径
path.logs: /opt/data1/es_logs
# 是否锁住内存
bootstrap.memory_lock: true
# 
network.host: 172.16.0.20
# 可以访问IP
network.bind_host: 0.0.0.0
# 与其他节点交互IP地址
# network.publish_host: 172.16.0.20
# 对外服务IP地址
http.port: 9500
# TCP传输是否进行压缩
transport.tcp.compress: true
# 节点间传输TCP端口
transport.tcp.port: 9600
# 可发现的种子节点
discovery.seed_hosts: ["172.16.0.20:9600", "172.16.0.22:9600", "172.16.0.23:9600"]
# 初始化可以选举master节点
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]
# 几个节点恢复后开始恢复数据
gateway.recover_after_nodes: 2

jvm.options

3.4 系统参数

# 将/etc/fstab 文件中包含swap的行注释掉
sed -i '/swap/s/^/#/' /etc/fstab
swapoff -a

# 单用户可以打开的最大文件数量，可以设置为官方推荐的65536或更大些
echo "* - nofile 655360" >> /etc/security/limits.conf

# 单用户线程数调大
echo "* - nproc 131072" >> /etc/security/limits.conf

# 单进程可以使用的最大map内存区域数量
echo "vm.max_map_count = 655360" >> /etc/sysctl.conf
# 参数修改立即生效
sysctl -p

3.5 启动

./elasticsearch -d

4. Hive配置创建

hive>add jar /home/elasticsearch-hadoop-7.17.0.jar

或者

<property>
 <name>hive.aux.jars.path</name>
 <value>/path/elasticsearch-hadoop.jar</value>
 <description>A comma separated list (with no spaces) of the jar files</description>
</property>

5. 准备数据

cat >test.txt <<
EOF
111,aaa
222,bbb
333,ccc

6. 创建ES对接外部表

create table test(key string,value string) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'test', 'es.nodes'='192.168.200.100', 'es.port'='9200', 'es.nodes.wan.only'='true');

7. 创建测试表

# 创建数据源表
CREATE TABLE test1(key string,value string)  row format delimited fields terminated by ',' stored as textfile;

# 导入数据
load data local inpath '/mnt/test.txt' into table test1;

8. 加载MR结果到HIVE

insert into table test select * from test1;

9. ES查询数据

10. 优化点

Linux系统参数调整
Elasticsearch JVM内存
Es 配置多个磁盘
Es按天索引
Es分片大小
Es入数据时需要将副本数设置为0，后期再进行调整即可
Es集群节点数
修改索引刷新时间与大小
每个节点最大分片数是1000，可以调整
Hive导出数据到Es外部表时，如果两张表字段不一致需要先导入到中间表，再导入Es外部表

Linux

更多推荐

Catf1ag CTF Web（三）

Linux

Linux--网络层IP

Linux

Linux入门：Linux权限解析

Linux

所有评论(0)

查看更多评论

奔向大数据的凡小王

@weixin_39561762

已为社区贡献1条内容