Hadoop

1、Hadoop是什么
Hadoop是一种分析和处理海量数据的软件平台,使用JAVA开发,提供一个分布式基础架构
Hadoop特点:高可靠性、高扩展性、高效性、高容错性、低成本

2)Hadoop组件

HDFS:		分布式文件系统(核心组件)
MapReduce:	分布式计算框架(核心组件)
Yarn:		集群资源管理系统(核心组件)
Zookeeper:	分布式协作服务
kafka:		分布式消息队列
Hive:		基于Hadoop的数据仓库
Hbase:		分布式列存数据库

单机hadoop安装部署

[root@hdp1 ~]# tar xf hadoop-2.7.7.tar.gz
[root@hdp1 ~]# mv hadoop-2.7.7 /usr/local/hadoop
[root@hdp1 ~]# chown -R 0.0 /usr/local/hadoop
[root@hdp1 ~]# vim /etc/hosts
192.168.1.11  hdp1
[root@hdp1 ~]# cd /usr/local/hadoop
[root@hdp1 hadoop]# ls
bin  etc  include  lib  libexec  LICENSE.txt  NOTICE.txt  README.txt  sbin  share

配置JAVA环境变量,检测版本状态

[root@hdp1 ~]# yum install -y java-1.8.0-openjdk java-1.8.0-openjdk-devel
[root@hdp1 ~]# rpm -ql java-1.8.0-openjdk  | grep jre
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64/jre/bin/policytool
[root@hdp1 ~]# vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh
25:  export JAVA_HOME="java-jre路径"					#上面红色
33:  export HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
[root@hdp1 ~]# /usr/local/hadoop/bin/hadoop version
Hadoop 2.7.7

热点词汇分析举例 使用jar包

[root@hadoop1 ~]# cd /usr/local/hadoop
[root@hadoop1 hadoop]# mkdir input
[root@hadoop1 hadoop]# cp *.txt input/
[root@hdp1 hadoop]# ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount  ./input  ./output
#使用wordcount统计  output目录不需要创建
[root@hdp1 hadoop]# ls ./output/
part-r-00000       .part-r-00000.crc  _SUCCESS           ._SUCCESS.crc
[root@hdp1 hadoop]# cat ./output/*
""AS    2				#分析出现的频次
"AS     17
"COPYRIGHTS     1
"Contribution"  2
"Contributor"   2
"Derivative     1
#也可以使用别的jar包分析,官方提供的一些jar包(已写好分析规则),目录如下
[root@hdp1 hadoop]# ls /usr/local/hadoop/share/hadoop/mapreduce/
hadoop-mapreduce-client-app-2.7.7.jar         hadoop-mapreduce-client-jobclient-2.7.7-tests.jar
hadoop-mapreduce-client-common-2.7.7.jar      hadoop-mapreduce-client-shuffle-2.7.7.jar
hadoop-mapreduce-client-core-2.7.7.jar        hadoop-mapreduce-examples-2.7.7.jar
hadoop-mapreduce-client-hs-2.7.7.jar          lib
hadoop-mapreduce-client-hs-plugins-2.7.7.jar  lib-examples
hadoop-mapreduce-client-jobclient-2.7.7.jar   sources
#查看规则说明
[root@hdp1 hadoop]# ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar

分布式文件系统HDFS

HDFS是Hadoop体系中数据存储管理的基础,是一个高度容错的系统,用于在低成本的通用硬件上运行
HDFS角色和概念
Client角色: Client切分文件,访问HDFS,与NameNode交互,获取文件位置信息,与DataNode交互,读取和写入数据,Block每块缺省128MB大小每块可以多个副本。
NameNode:Master节点,管理HDFS的名称空间和数据块映射信息(fsimage),配置副本策略,处理所有客户端请求
DataNode:数据存储节点,存储实际的数据,汇报存储信息给NameNode
Secondary NameNode: 定期合并fsimage和fsedits,推送给NameNode。fsimage:名称空间和数据库的映射信息
fsedits:数据变更日志。紧急情况下,可辅助恢复NameNode

hadoop集群安装部署

a)安装java 主机名解析,所有机器都要执行

[root@hdp ~]# yum install -y java-1.8.0-openjdk java-1.8.0-openjdk-devel
[root@hdp ~]# vim /etc/hosts
192.168.1.11 node1
192.168.1.12 node2
192.168.1.13 node3
192.168.1.14 hdp

b)免密登陆配置仅在 hadoop1 上执行

[root@hdp ~]# ssh-keygen
[root@hdp ~]# for i in hdp node{1..3};do ssh-copy-id  ${i};done
[root@hdp ~]# vim /etc/ssh/ssh_config
# 60行新添加
	StrictHostKeyChecking no
##免密登录所有节点包括自己,验证时不需要yes

3)HDFS配置文件配置

配置文件语法格式 – 官方手册http://hadoop.apache.org/docs/r2.7.7/
xml语法格式 -XML指可扩展标记语言(EXtensible Markup Language)
XML中的每个元素名都是成对出现的。结束标签前加一个/ -hadoop配置语法

<property> 
	<name>关键字</name> 
	<value>变量值</value> 
	<description>描述</description> 
</property>

a)配置 hadoop-env.sh

[root@hdp ~]# tar xf hadoop-2.7.7.tar.gz
[root@hdp ~]# mv hadoop-2.7.7 /usr/local/hadoop
[root@hdp ~]# chown -R 0.0 /usr/local/hadoop

[root@hdp1 ~]# vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh
25:  export JAVA_HOME="java-jre路径"		
33:  export HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"

b)配置hadoop/slaves (localhost必须删除)

[root@hdp ~]# vim /usr/local/hadoop/etc/hadoop/slaves
node1
node2
node3

c)配置core-site.xml 文件系统参数fs.defaultFS、数据目录参数hadoop.tmp.dir
官方手册core-default.xml,查参数

[root@hdp ~]# vim /usr/local/hadoop/etc/hadoop/core-site.xml   
#配置文件系统配置参数、数据目录配置参数
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hdp:9000</value>     //hadoop主机IP
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/var/hadoop</value>
    </property>
</configuration>

d)配置hdfs-site.xml 查参数 namenode地址,secondarynamenode 地址,副本数量

[root@hdp ~]# vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>hdp:50070</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hdp:50090</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>      //两个
    </property>
</configuration>

e)启动集群 [仅在 hadoop1 上执行] hdfs集群所有节点配置文件是相同的!!!

[root@hdp ~]# for i in node1 node2 node3;do rsync -aXSH --delete /usr/local/hadoop ${i}:/usr/local/;done				//启动
[root@hdp ~]# mkdir /var/hadoop
[root@hdp ~]# /usr/local/hadoop/bin/hdfs namenode -format
#格式化,没报错信息ok
[root@hdp ~]# /usr/local/hadoop/sbin/start-dfs.sh
Starting namenodes on [hdp]
hdp: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-hdp.out
node3: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-node3.out
node1: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-node1.out
node2: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-node2.out
Starting secondary namenodes [hdp]
#日志文件位置
hdp: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-hdp.out
日志名称
服务名-用户名-角色名-主机名.out标准输出
服务名-用户名-角色名-主机名.log日志信息

f)验证集群配置

#验证角色jps
[root@hdp ~]# for i in node1 node2 node3;do echo ${i}; ssh ${i} jps; echo -e "\n"; done
node1
1525 DataNode
1605 Jps

node2
1485 DataNode
1566 Jps

node3
1458 DataNode
1558 Jps
[root@hdp ~]# /usr/local/hadoop/bin/hdfs dfsadmin -report    //#验证集群
Configured Capacity: 54716792832 (50.96 GB)
Present Capacity: 47724560384 (44.45 GB)
DFS Remaining: 47724535808 (44.45 GB)
DFS Used: 24576 (24 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
Live datanodes (3):							//正常节点数
Name: 192.168.1.11:50010 (node1)
Hostname: node1
Decommission Status : Normal
Configured Capacity: 18238930944 (16.99 GB)
DFS Used: 8192 (8 KB)
Non DFS Used: 2156576768 (2.01 GB)
DFS Remaining: 16082345984 (14.98 GB)
DFS Used%: 0.00%
DFS Remaining%: 88.18%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Mar 08 12:09:49 CST 2022
...

4)部署mapred-site.xml mapreduce资源管理类

[root@hdp hadoop]# cp mapred-site.xml.template mapred-site.xml
[root@hdp hadoop]# vim mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Mapreduce角色

JobTracker:
		Master节点只有一个,管理所有作业/任务的监控、错误处理等,将任务分解,并分派给TaskTracker
TaskTracker:
		Slave节点,一般是多台,运行Map Task和Reduce Task,并与JobTracker交互,汇报任务状态
Map Task:
		解析每条数据记录,传递给用户编写的map()并执行,-将输出结果写入本地磁盘,如果为map-only作业,直接写入HDFS.
Reducer Task:
		从Map Task的执行结果中,远程读取输入数据,对数据进行排序,将数据按照分组传递给用户编写的reduce函数执行

5)Yarn角色部署yarn-site.xml

Resourcemanager
	ApplicationMaster 
	Container ·Nodemanager
Container: 对任务运行行环境的抽象,封装了CPU、内存等启动命令等任务运行相关的信息资源分配与调度
ApplicationMaster: 数据切分为应用程序申请资源,并分配给内部任务任务监控与容错
ResourceManager
	处理客户端请求,资源分配与调度,启动/监控 ApplicationMaster,监控NodeManager
NodeManager
	单个节点上的资源管理,处理来自ResourceManager的命令,处理来自ApplicationMaster的命令

a)yarn-site.xml配置

[root@hdp ~]# vim /usr/local/hadoop/etc/hadoop/yarn-site.xml
<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hdp</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

b)启动集群 --仅在 hdp 上执行

[root@hdp ~]# for i in node1 node2 node3;do  rsync -avXSH --delete /usr/local/hadoop/etc  ${i}:/usr/local/hadoop/;  done
[root@hdp ~]# /usr/local/hadoop/sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-root-resourcemanager-hdp.out
node1: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-node1.out
node2: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-node2.out
node3: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-node3.out

c)验证集群

[root@hdp ~]# for i in hdp node1 node2 node3;do echo ${i}; ssh ${i} jps; echo -e "\n"; done
hdp
13329 Jps
2770 NameNode			//3个角色
13060 ResourceManager
4175 SecondaryNameNode


node1
1809 Jps
1684 NodeManager
1525 DataNode


node2
11712 Jps
11586 NodeManager
1485 DataNode


node3
1458 DataNode
11589 NodeManager
11717 Jps
[root@hdp ~]# /usr/local/hadoop/bin/yarn node -list
22/03/08 12:29:41 INFO client.RMProxy: Connecting to ResourceManager at hdp/192.168.1.14:8032
Total Nodes:3
 Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
 node3:36274                RUNNING        node3:8042                   0
 node2:32906                RUNNING        node2:8042                   0
 node1:40513                RUNNING        node1:8042                   0

d)web页面访问端口
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

Logo

瓜分20万奖金 获得内推名额 丰厚实物奖励 易参与易上手

更多推荐