大数据平台搭建详解

大数据平台搭建(实验室版)1大数据介绍：1.1背景进入21世纪随着计算机科学的迅速发展，人民生活水平的提高，各种交易产生的数据也飞速增长，2016天猫双11谢幕，根据阿里公布的实时数据，截至11日24:00:00，2016天猫双11全球狂欢节总交易额超1207亿，无线交易额占比81.87%，覆盖235个国家和地区，当然还包括京东618等，产生的数据都是相当可观的，如此大的数据计算，统计，分析，使用

dingyanming

33057人浏览 · 2018-05-12 11:41:37

dingyanming · 2018-05-12 11:41:37 发布

大数据平台搭建(实验室版)

1大数据介绍：

1.1背景

进入21世纪随着计算机科学的迅速发展，人民生活水平的提高，各种交易产生的数据也飞速增长，2016天猫双11谢幕，根据阿里公布的实时数据，截至11日24:00:00，2016天猫双11全球狂欢节总交易额超1207亿，无线交易额占比81.87%，覆盖235个国家和地区，当然还包括京东618等，产生的数据都是相当可观的，如此大的数据计算，统计，分析，使用一台计算机是无法完成的，如果这些数据用一台机器计算，那么又需要多么强大的一台机器呢？小编也不太清楚，因此产生了分布式（就是将一份庞大的工作分发给多台计算机同时计算，最后将结果合并）这就是大数据！

1.2大数据的重要性

在近几年随着大数据的迅速发展，越来越多的人开始学习大数据，甚至好多都是半路出家的，其中小编就是半路出家的，小编之前是做stl语言的，那么为什么大数据工资高？为什么那么多人去学大数据呢？因为大数据太重要了，大数据无处不在，涉及到的行业也太多了，比如以下举例

1.2.1 金融业

金融业涉及到的内容很多什么p2p等，做金融的公司靠什么盈利呢？和银行一样，通过卖理财产品，（如果不知道什么是理财产品，请自行百度），如果你买了理财产品，那么理财公司就把你的钱作为运营资金放贷款，公司放贷之前是需要对你的信用度进行审核的（业界称之为征信）那么征信就需要对你尽可能多的海量数据进行分析，因为数据量大，因此就需要大数据提供基础平台的数据。

1.2.2 电信行业

电信行业可以通过人建立模型分析出你下个月是不是会取消宽带。

1.2.3电商业

同样电商业也可以通过一定算法建立模型，分析出你喜欢的商品发送给你URL（业界叫精准推送）

1.2.4机器学习

机器学习也需要大量的数据让机器学习，比如监督式学习需要的大量训练数据都是来自大数据平台，设计到的行业很多，在这里不做过多介绍。

2大数据平台搭建（实验室版）

2.1平台搭建说明

由于资源限制，只能是做实验室版搭建，而非真正的生产环境，生产环境和实验室环境思路相同，只不过是配置更高，面对生产问题需要做一些调整而已。本次搭建采用vmware搭建5台服务器，配置可根据个人电脑具体情况调节

5台机器分别为node1到node5

2.2 node1服务器搭建

打开vm，点击新建，选择经典即可，点击下一步

大数据平台搭建(实验室版)

1大数据介绍：

1.1背景

1.2大数据的重要性

1.2.1 金融业

1.2.2 电信行业

电信行业可以通过人建立模型分析出你下个月是不是会取消宽带。

1.2.3电商业

同样电商业也可以通过一定算法建立模型，分析出你喜欢的商品发送给你URL（业界叫精准推送）

1.2.4机器学习

机器学习也需要大量的数据让机器学习，比如监督式学习需要的大量训练数据都是来自大数据平台，设计到的行业很多，在这里不做过多介绍。

2大数据平台搭建（实验室版）

2.1平台搭建说明

5台机器分别为node1到node5

2.2 node1服务器搭建

打开vm，点击新建，选择经典即可，点击下一步

然后选择centos64和linux，如下，然后点击下一步

选择虚拟机上机器的名称已经安装机器的位置，然后下一步

然后选择

选择自定义硬件，如下

如上图所示，内存，处理器等可以根据自己的需求自行设置，需要注意的是网络需要选择nat模式，自定义也可以，如果选择其他模式ip变化了，在集群运行时是会出问题的，小编的配置如下

接着上边说，配置完硬件后点击完成，点击开启虚拟机，为了简单以下截图而过了

说明:

1.install or upgrade an existing system
安装或升级现有系统

2.install system with basic video drive
安装过程中采用基本的显卡驱动

3.rescue installed system
进入系统修复模式

4.Boot from local drive
退出安装从硬盘启动

5.memory test
内存检测

6.press tab to edit options
按“tab”来编辑选项

在这里选择1就可以了

然后如下选择

，，

6.选择英文或者中单都可以,在这里推荐选择英文

然后选择美式键盘

这里选择基本存储，无需要选择指定存储如下

这里选择格式化任何数据

小编的网络配置如下

根据自己具体的网络情况定义

编辑完成后点击apply，此时已经退出编辑页面，然后点击下一步

选择亚洲，上海

设置服务器的密码

然后选择使用使用所有空间，不再做过多分区介绍了，感兴趣自己研究下，如下

然后选择如下，因为是实验环境为了方便在上边开发，我们安装了桌面模式，真正生产中是只选择基础模式就可以了，如下

然后就是系统安装了，然后

后边都直接选择默认即可。

我们登录系统后需要通过root用户在根目录下创建个数据目录，来模仿企业的数据盘，同时将app目录授权给hadoop，执行命令如下

chown -R hadoop.hadoop app我们将所有的安装，然后

我们在app下建立两个目录，opt和pro，到这里我们的系统算是差不多了，需要用到的后边在补写。

2.3大数据平台基础依赖安装

2.3.1 jdk安装

在这里首先说一下，系统默认会带openjdk，系统自带的这种不好用，我在这里就直接卸载了，命令如下

以root权限执行rpm -qa | grep jdk ，如下

然后 yum -y remove 你所过滤出来的包，我的卸载如下

rpm -e --nodeps java-1.7.0-openjdk-1.7.0.45-2.4.3.3.el6.x86_64

rpm -e --nodepsjava-1.6.0-openjdk-1.6.0.0-1.66.1.13.0.el6.x86_64

然后检查一下是否卸载了，我们可以执行个java，然后报错了，说明卸载干了

在这里我们选择1.8，因为scala等还需要依赖（1.7也可以，不推荐1.6，1.6用的人很少了）

具体jdk安装请查看百度，在这里不做过多的介绍，只做简单介绍

我的JDK放在这个目录下了，如下图所示

然后，我们配置系统级别的环境变量，命令如下

vim /etc/profile

然后在这里添加环境变量即可

export JAVA_HOME=/app/opt/jdk/jdk1.8.0_141

export PATH=$PATH:$JAVA_HOME/bin:

然后source/etc/profile

OK，这样我们的jdk就安装好了，可以验证一下

OK，到这里说明jdk已经安装好了，环境变量也生效了

2.3.2 scala安装

由于后边我们玩spark，spark是用scala语言开发的，所以我们需要安装scala语言

一般系统默认是不会自带scala的，所以不需要卸载，下面我们开始安装

同样下载scala tar包，我这里下载的是

我的安装目录在这里

然后vm/etc/profile

然后source/etc/profile

验证和java相同，在这里就不再介绍了

2.3.3 Python

由于系统自带的Python是2.6.6,可以满足我们可基本需求，在这里我就不再安装了，需要安装的请自行百度，很简单的！

2.3.4系统准备

2.3.4.1准备五台服务器

我们将现有的一台服务复制5份，找一个大于100GB的空间，我这里笔记本资源不多了，为了方便起见，我放在移动硬盘上了，分别命名为node1,node2,node3,node4,node5,当然你也可以克隆

2.3.4.2 其他设置

1 分别将5台服务器机IP改成192.168.21.161到192.168.21.165，(由于是虚拟机复制（在复制时候，会再次生成一个非eth0的网卡，这是由于复制过程中又生成了新的mac地址），会导致网卡从eth0依次增加，为了网卡统一，我们需要将所有的网卡修改成eth0)，首先启动你刚才复制的机器

1 cd /etc/udev/rules.d/

2 现将cp 70-persistent-net.rules70-persistent-net.rules.bak 文件备份一下在编辑，以防出错

3 vim 70-persistent-net.rules 将eth0那行注释掉，将eth1改成eth0，如下

4 vim /etc/sysconfig/network-scripts/ifcfg-eth0 将HWADDR改成上面那个信息(也就是新改成那行的HWADDR的后边的吗)，就是MAC地址，并配置IP，同时记得将ONBOOT改为yes（默认是no，yes是启用开机自动启动网卡功能）如下

然后重启网卡，servicenetwork restart即可生效，

5 因为是复制的虚拟机，需要将机器名依次修改成node1到node5

6 命令： hostname node2 临时将机器名修改为node2，如果重启机器又会恢复原来的机器名，故需要永久修改，修改命令如下

vim /etc/sysconfig/network 如下

’

7 关闭selinu，默认是开启的，需要关闭，常用命令如下

sestatus 查看状态

setenforce 0 临时关闭

vim /etc/selinux/config 将SELINUX=enforcing修改为disabled

8 关闭防护墙

serviceiptables status 查看防火墙状态

serviceiptables stop 关闭防火墙

chkconfigiptables off 永久关闭防火墙

9 修改系统最大打开文件数量，默认为1024，我们将其修改为65535

解除Linux 系统的最大进程数和最大文件打开数限制：

修改所有linux 用户的环境变量文件

ulimit –a 查看linux相关信息

ulimit -n 65535 临时修改为这么大

永久修改如下

vim/etc/security/limits.conf 具体需要修改的东西，根据具体情况修改即可，授权给所有用户也可以修改/etc/profile ,我这里把open files数量修改为65535，max user processes 修改为257612

将开机启动默认设置为字符界面，vim /etc/inittab 将id:5:修改为id:3重启即可

到这里系统级别的依赖基本上差不多了，组件需要我们在后边在安装，比如ntp等

11 这样我们将5台机器出了IP和机器名都配置不同，其他配置全部相同（当然硬件配置可以根据主从节点修改）

12 免秘钥登录以hadoop权限做免秘钥登录（因为hadoop集群主从节点访问需要免秘钥访问，严格来说，只需要将做主节点到从节点免秘钥即可，但是为了方便我全部做免秘钥了，以便以后同步一些东西方便）

以node1的hadoop用户为例

进入/home/hadoop/.ssh目录

执行 ssh-keygen -t dsa 生成公钥和私钥

然后将公钥追加到一个文件 cat id_dsa.pub >> authorized_keys

然后将authorized_keys拷贝到node2,3,4,5服务器上

最后在node下执行chmod 600 authorized_keys即可实现node1到其他节点的免秘钥登录，其他节点同理，这里不再累赘了！

2.4 hadoop安装

2.4.1 hadoop的下载，官方下载hadoop编译好的包也可以下载源码自行编译，在这里我们下载官方预编译版本2.7.3吧，用一下新版本

2.4.1解压

将hadoop-2.7.3.tar.gz包放再/app/opt/hadoop/下然后解压

2.4.2目录介绍

进入cdhadoop-2.7.3/ ll

目录介绍

2.4.2.1 bin

bin 目录介绍

主要是存放一下hadoop命令

container-executor 执行container

hadoop 执行hadoop相关命令，比如hadoop fs等（linux环境运用）

hadoop.cmd 执行hadoop相关命令，比如hadoop fs等（windows环境运用）

hdfs 执行hdfs相关命令，比如hdds dfsadmin等（linux环境运用）

hdfs.cmd 执行hdfs相关命令，比如hdds dfsadmin等（windows环境运用）

mapred 执行map命令（linux运用）

mapred.cmd 执行map命令（windows运用）

rcc 用来生成 java和c++的hadoop Record 类代码

test-container-executor 用来测试container-executor

yarn 用来执行yarn的相关命令比如yarn –node list（linux）

yarn.cmd 用来执行yarn的相关命令比如yarn –node list（windows）

具体详细命令请查看官方文档或者帮助文件

2.4.2.2 etc

etc下边的hadoop目录下存放着hadoop的配置文件，默认如下

capacity-scheduler.xml capacity调度相关配置文件，主要是和yarn结合使用

configuration.xsl 根元素配置（不是特别清楚）

container-executor.cfg container-executor执行先关配置

core-site.xml hadoop核心配置文件

hadoop-env.cmd hadoop启动运行环境（windows）

hadoop-env.sh hadoop启动运行环境（linux）

hadoop-metrics2.properties 控制Hadoop报告

hadoop-metrics.properties 控制Hadoop报告

hadoop-policy.xml 权限管理相关配置

hdfs-site.xml hdfs配置文件

httpfs-env.sh httpfs启动环境配置

httpfs-log4j.properties httpfs-log4j配合模板

httpfs-signature.secret 好像是一个认证密令的配置（不太清楚）

httpfs-site.xml httpfs配置

kms-acls.xml kms安装认证

kms-env.sh kms启动运行环境

kms-log4j.properties kms 日志模板

kms-site.xml kms配置

log4j.properties log4j模板

mapred-env.cmd map启动运行时环境（windows）

mapred-env.sh map启动运行时环境（linux）

mapred-queues.xml.template map队列模板配置

mapred-site.xml.template map配置文件

slaves 从节点配置

ssl-client.xml.example ssl客户端配置模板

ssl-server.xml.example ssh服务端配置模板

yarn-env.cmd yarn启动时运行环境（windows）

yarn-env.sh yarn启动时运行环境（linux）

yarn-site.xml yarn配置文件

2.4.2.4 include

对外提供的编程库文件，与lib结合相关，这些头文件均是用C++定义的，通常用于C++访问dfs和编写mr

2.4.2.5 lib

该目录包含了Hadoop对外提供的编程动态库和静态库，与include目录中的头文件结合使用

2.4.2.6 libexec

用来提供hadoop相关启动命令的shell初始化调用。

2.4.2.7 sbin

distribute-exclude.sh 没用过，不太清楚

hadoop-daemon.sh 启动单个任务

hadoop-daemons.sh 启动多个任务

hdfs-config.cmd hdfs-config调用（windows）

hdfs-config.sh hdfs-config调用（linux）

httpfs.sh httpfs执行脚本

kms.sh kms执行脚本

mr-jobhistory-daemon.sh mr启动job历史服务

refresh-namenodes.sh 恢复namenode脚本

slaves.sh 执行datanode节点

start-all.cmd 启动所有组件（windows）

start-all.sh 启动所有组件（linux）

start-balancer.sh 手动负载均衡

start-dfs.cmd 启动dfs（windows）

start-dfs.sh 启动dfs（linux）

start-secure-dns.sh 启动secure-dns

start-yarn.cmd 启动yarn（windows）

start-yarn.sh 启动yarn（linux）

stop-all.cmd 停止hadoop所有组件（windows）

stop-all.sh 停止hadoop所有组件（linux）

stop-balancer.sh 停止负载均衡

stop-dfs.cmd 停止dfs（windows）

stop-dfs.sh 停止dfs（linux）

stop-secure-dns.sh 停止secure-dns

stop-yarn.cmd 停止yarn（windows）

stop-yarn.sh 停止yarn（linux）

yarn-daemon.sh 停止yarn的单个节点的yarn相关组件

yarn-daemons.sh 停止yarn的多个节点的yarn相关组件

2.4.2.8 share

共享的demo和编译后的相关jar包等

2.4.3 hadoop配置文件配置

2.4.3.1 capacity-scheduler.xml配置

<!--

Licensed under the Apache License, Version 2.0 (the"License");

youmay not use this file except in compliance with the License.

Youmay obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS"BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Seethe License for the specific language governing permissions and

limitations under the License. See accompanying LICENSE file.

-->

<name>yarn.scheduler.capacity.root.queues</name>

<value>offline,online,default</value>

<description>The sub queues in the newly created root queueinclude offline queues, online queues, default queues</description>

</property>

<name>yarn.scheduler.capacity.root.offline.capacity</name>

<description>Offline occupancy ratio</description>

</property>

<name>yarn.scheduler.capacity.root.offline.queues</name>

<value>hive,sqoop,spark_thrift</value>

</property>

<name>yarn.scheduler.capacity.root.offline.hive.capacity</name>

</property>

<name>yarn.scheduler.capacity.root.offline.hive.user-limit-factor</name>

<description>the limit ofeveryuser use resource</description>

</property>

<name>yarn.scheduler.capacity.root.offline.hive.maximum-capacity</name>

</property>

<name>yarn.scheduler.capacity.root.offline.sqoop.capacity</name>

</property>

<name>yarn.scheduler.capacity.root.offline.sqoop.user-limit-factor</name>

</property>

<name>yarn.scheduler.capacity.root.offline.sqoop.maximum-capacity</name>

</property>

<name>yarn.scheduler.capacity.root.offline.spark_thrift.capacity</name>

</property>

<name>yarn.scheduler.capacity.root.offline.spark_thrift.user-limit-factor</name>

</property>

<name>yarn.scheduler.capacity.root.offline.spark_thrift.maximum-capacity</name>

</property>

<name>yarn.scheduler.capacity.root.online.maximum-capacity</name>

The maximum capacity of thedefault queue.

</description>

</property>

<name>yarn.scheduler.capacity.root.online.capacity</name>

</property>

<name>yarn.scheduler.capacity.maximum-applications</name>

Maximum number of applications that can be pending and running.

</description>

</property>

<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>

Maximum percent of resources in the cluster which can be used to run

application masters i.e. controls number of concurrent running

applications.

</description>

</property>

<name>yarn.scheduler.capacity.resource-calculator</name>

<value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>

The ResourceCalculator implementation to be used to compare

Resources in the scheduler.

The default i.e. DefaultResourceCalculator only uses Memory while

DominantResourceCalculator uses dominant-resource to compare

multi-dimensional resources such as Memory, CPU etc.

</description>

</property>

<name>yarn.scheduler.capacity.root.queues</name>

<value>default</value>

The queues at the this level (root is the root queue).

</description>

</property>

<name>yarn.scheduler.capacity.root.default.capacity</name>

<description>Default queue target capacity.</description>

</property>

<name>yarn.scheduler.capacity.root.default.user-limit-factor</name>

Default queue user limit a percentage from 0.0 to 1.0.

</description>

</property>

<name>yarn.scheduler.capacity.root.default.maximum-capacity</name>

The maximum capacity of the default queue.

</description>

</property>

<name>yarn.scheduler.capacity.root.default.state</name>

<value>RUNNING</value>

The state of the default queue. State can be one of RUNNING or STOPPED.

</description>

</property>

<name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>

The ACL of who can submit jobs to the default queue.

</description>

</property>

<name>yarn.scheduler.capacity.root.default.acl_administer_queue</name>

The ACL of who can administer jobs on the default queue.

</description>

</property>

<name>yarn.scheduler.capacity.node-locality-delay</name>

Number of missed scheduling opportunities after which theCapacityScheduler

attempts to schedule rack-local containers.

Typically this should be set to number of nodes in the cluster, Bydefault is setting

approximately number of nodes in one rack which is 40.

</description>

</property>

<name>yarn.scheduler.capacity.queue-mappings</name>

A list of mappings that will be used to assign jobs to queues

The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]*

Typically this list will be used to map users to queues,

for example, u:%user:%user maps all users to queues with the same name

as the user.

</description>

</property>

<name>yarn.scheduler.capacity.queue-mappings-override.enable</name>

<value>false</value>

If a queue mapping is present, will it override the value specified

by the user? This can be used by administrators to place jobs in queues

that are different than the one specified by the user.

The default is false.

</description>

</property>

</configuration>

2.4.3.2 core-site.xml配置

<?xml version="1.0"encoding="UTF-8"?>

<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>

<!--

Licensed under the Apache License, Version 2.0 (the"License");

youmay not use this file except in compliance with the License.

Youmay obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS"BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Seethe License for the specific language governing permissions and

limitations under the License. See accompanying LICENSE file.

-->

<name>fs.defaultFS</name>

</property>

<name>hadoop.tmp.dir</name>

<value>/app/data/hadoop/hdfs/tmp</value>

</property>

<name>ha.zookeeper.quorum</name>

</property>

<name>dfs.datanode.max.transfer.threads</name>

</property>

<name>hadoop.proxyuser.hue.hosts</name>

</property>

<name>hadoop.proxyuser.hue.groups</name>

</property>

<name>hadoop.security.instrumentation.requires.admin</name>

<value>false</value>

</property>

</configuration>

2.4.3.3 hadoop-env.sh 配置

# Licensed to the Apache SoftwareFoundation (ASF) under one

# or more contributor licenseagreements. See the NOTICE file

# distributed with this work for additionalinformation

# regarding copyright ownership. The ASF licenses this file

# to you under the Apache License, Version2.0 (the

# "License"); you may not usethis file except in compliance

# with the License. You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law oragreed to in writing, software

# distributed under the License isdistributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANYKIND, either express or implied.

# See the License for the specific languagegoverning permissions and

# limitations under the License.

# Set Hadoop-specific environment variableshere.

# The only required environment variable isJAVA_HOME. All others are

# optional. When running a distributed configuration it is best to

# set JAVA_HOME in this file, so that it iscorrectly defined on

# remote nodes.

# The java implementation to use.

export JAVA_HOME=/app/opt/jdk/jdk1.8.0_141

# The jsvc implementation to use. Jsvc isrequired to run secure datanodes

# that bind to privileged ports to provide authenticationof data transfer

# protocol. Jsvc is not required if SASL is configured for authentication of

# data transfer protocol usingnon-privileged ports.

#export JSVC_HOME=${JSVC_HOME}

exportHADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

# Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.

for f in$HADOOP_HOME/contrib/capacity-scheduler/*.jar; do

if[ "$HADOOP_CLASSPATH" ]; then

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f

else

export HADOOP_CLASSPATH=$f

done

# The maximum amount of heap to use, in MB.Default is 1000.

#export HADOOP_HEAPSIZE=

#exportHADOOP_NAMENODE_INIT_HEAPSIZE=""

# Extra Java runtime options. Empty by default.

export HADOOP_OPTS="$HADOOP_OPTS-Djava.net.preferIPv4Stack=true"

# Command specific options appended toHADOOP_OPTS when specified

exportHADOOP_NAMENODE_OPTS="-XX:NewSize=228m -XX:MaxNewSize=512m -Xms2048m-Xmx2048m -XX:PermSize=228m -XX:MaxPermSize=356m-Xloggc:/app/logs/hadoop/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails-XX:+PrintGCTimeStamps -Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS}-Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender}$HADOOP_NAMENODE_OPTS"

exportHADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

exportHADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS}-Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender}$HADOOP_SECONDARYNAMENODE_OPTS"

exportHADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"

export HADOOP_PORTMAP_OPTS="-Xmx512m$HADOOP_PORTMAP_OPTS"

# The following applies to multiplecommands (fs, dfs, fsck, distcp etc)

export HADOOP_CLIENT_OPTS="-Xmx512m$HADOOP_CLIENT_OPTS"

#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData$HADOOP_JAVA_PLATFORM_OPTS"

# On secure datanodes, user to run thedatanode as after dropping privileges.

# This **MUST** be uncommented to enablesecure HDFS if using privileged ports

# to provide authentication of datatransfer protocol. This **MUST NOT** be

# defined if SASL is configured forauthentication of data transfer protocol

# using non-privileged ports.

exportHADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}

# Where log files are stored. $HADOOP_HOME/logs by default.

export HADOOP_LOG_DIR=/app/logs/hadoop

# Where log files are stored in the securedata environment.

exportHADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}

###

# HDFS Mover specific parameters

###

# Specify the JVM options to be used whenstarting the HDFS Mover.

# These options will be appended to theoptions specified as HADOOP_OPTS

# and therefore may override any similarflags set in HADOOP_OPTS

# export HADOOP_MOVER_OPTS=""

###

# Advanced Users Only!

###

# The directory where pid files are stored./tmp by default.

# NOTE: this should be set to a directorythat can only be written to by

# the user that will run the hadoop daemons. Otherwise there is the

# potential for a symlink attack.

export HADOOP_PID_DIR=/app/data/hadoop/pids

export HADOOP_SECURE_DN_PID_DIR=/app/data/hadoop/pids

# A string representing this instance ofhadoop. $USER by default.

export HADOOP_IDENT_STRING=$USER

2.4.3.4 hdfs-site.xml配置

<?xml version="1.0"encoding="UTF-8"?>

<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>

<!--

Licensed under the Apache License, Version 2.0 (the"License");

youmay not use this file except in compliance with the License.

Youmay obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS"BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Seethe License for the specific language governing permissions and

limitations under the License. See accompanying LICENSE file.

-->

<name>dfs.nameservices</name>

</property>

<name>dfs.ha.namenodes.ns</name>

</property>

<name>dfs.namenode.rpc-address.ns.nn1</name>

</property>

<name>dfs.namenode.http-address.ns.nn1</name>

</property>

<name>dfs.namenode.rpc-address.ns.nn2</name>

</property>

<name>dfs.namenode.http-address.ns.nn2</name>

</property>

<name>dfs.namenode.shared.edits.dir</name>

<value>qjournal://node3:8485;node4:8485;node5:8485/ns</value>

</property>

<name>dfs.journalnode.edits.dir</name>

<value>/app/data/hadoop/hdfs/journal</value>

</property>

<name>dfs.ha.automatic-failover.enabled</name>

</property>

<name>dfs.client.failover.proxy.provider.ns</name>

<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

</property>

<name>dfs.ha.fencing.methods</name>

<value>sshfence</value>

</property>

<name>dfs.ha.fencing.ssh.private-key-files</name>

<value>/home/hadoop/.ssh/id_rsa</value>

</property>

<name>dfs.namenode.name.dir</name>

<value>file:///app/data/hadoop/hdfs/name</value>

</property>

<name>dfs.datanode.data.dir</name>

<value>file:///app/data/hadoop/hdfs/data</value>

</property>

<name>dfs.replication</name>

</property>

</configuration>

2.4.3.5 mapred-site.xml配置

<?xml version="1.0"?>

<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>

<!--

Licensed under the Apache License, Version 2.0 (the"License");

youmay not use this file except in compliance with the License.

Youmay obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS"BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Seethe License for the specific language governing permissions and

limitations under the License. See accompanying LICENSE file.

-->

<name>mapreduce.framework.name</name>

</property>

<name>mapreduce.map.java.opts</name>

</property>

<name>mapreduce.reduce.java.opts</name>

</property>

<name>yarn.app.mapreduce.am.resource.mb</name>

</property>

<name>mapreduce.map.output.compress</name>

<value>false</value>

</property>

<name>mapreduce.output.fileoutputformat.compress</name>

<value>false</value>

</property>

<name>mapreduce.output.fileoutputformat.compress.type</name>

<value>BLOCK</value>

</property>

<name>mapreduce.jobhistory.address</name>

</property>

<name>mapreduce.jobhistory.webapp.address</name>

</property>

<name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name>

<value>false</value>

</property>

</configuration>

2.4.3.6 slave配置

node2

node3

node4

node5

2.4.3.6 yarn-site-xml 配置

<?xml version="1.0"?>

<!--

Licensed under the Apache License, Version 2.0 (the"License");

youmay not use this file except in compliance with the License.

Youmay obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS"BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Seethe License for the specific language governing permissions and

limitations under the License. See accompanying LICENSE file.

-->

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<name>yarn.resourcemanager.ha.enabled</name>

</property>

<name>yarn.resourcemanager.cluster-id</name>

</property>

<name>yarn.resourcemanager.ha.rm-ids</name>

</property>

<name>yarn.resourcemanager.hostname.rm1</name>

</property>

<name>yarn.resourcemanager.hostname.rm2</name>

</property>

<name>yarn.resourcemanager.webapp.address.rm1</name>

</property>

<name>yarn.resourcemanager.webapp.address.rm2</name>

</property>

<name>yarn.resourcemanager.zk-address</name>

</property>

<name>yarn.nodemanager.resource.memory-mb</name>

</property>

<name>yarn.scheduler.maximum-allocation-mb</name>

</property>

<name>yarn.scheduler.minimum-allocation-mb</name>

</property>

<name>yarn.nodemanager.vmem-check-enabled</name>

<value>false</value>

<description>Whether virtual memory limits will be enforcedfor containers</description>

</property>

<name>yarn.nodemanager.vmem-pmem-ratio</name>

<description>Ratio between virtual memory to physical memory whensetting memory limits for containers</description>

</property>

<name>yarn.nodemanager.resource.cpu-vcores</name>

</property>

<name>yarn.scheduler.maximum-allocation-vcores</name>

</property>

<name>yarn.scheduler.minium-allocation-vcores</name>

</property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>

</property>

<name>yarn.nodemanager.log-dirs</name>

</property>

<name>yarn.log-aggregation-enable</name>

</property>

</configuration>

目前我们先配置到这里吧，如果有其他需要改动的，或则修改的配置，我们可以再修改，水平有限，多多包涵！

2.4.4 zookeeper安装

2.4.4.1 zookeeper简单介绍

什么是ZooKeeper？

ZooKeeper是用于维护配置信息，命名，提供分布式同步和提供组服务的集中式服务。所有这些类型的服务以分布式应用程序以某种形式或另一种形式使用。每次执行它们时，都会有很多工作用于修复不可避免的错误和竞争条件。由于实施这些服务的困难，应用程序最初通常会吝啬，这使得它们在变化存在时变脆，难以管理。即使正确完成，这些服务的不同实现也会导致应用程序部署时的管理复杂性

1 基础扫盲

zookeeper以角色的形式存在，那么zookeeper设计到的角色有哪些呢？

leader leader是zookeeper中角色之一，其作用是负责投票的发起和决议，以及更新系统状态，集群中有且只（允许）有一个leader，其实说白了就是zookeeper集群中的主节点

Follower follower用来接收客户端（默认2181）端口发来的消息，将处理结果返回给客户端，并且参与leader选举投票

Observer Observer是用来接收客户端请求，转发给leader的功能，不参与投票

Client Client作用是向zookeeper集群发起请求

2 zookeeper总架构

组成Zookeeper的各个服务器必须要能相互通信。他们在内存中保存了服务器状态，也保存了操作的日志，并且持久化快照。只要大多数的服务器是可用的，那么Zookeeper就是可用的。

客户端连接到一个Zookeeper服务器，并且维持TCP连接。并且发送请求，获取回复，获取事件，并且发送连接信号。如果这个TCP连接断掉了，那么客户端可以连接另外一个服务器。

(2)Zookeeper是有序的

Zookeeper使用数字来对每一个更新进行标记。这样能保证Zookeeper交互的有序。后续的操作可以根据这个顺序实现诸如同步操作这样更高更抽象的服务。

(3)Zookeeper是高效的

Zookeeper的高效更表现在以读为主的系统上。Zookeeper可以在千台服务器组成的读写比例大约为10:1的分布系统上表现优异。

(4)数据结构和分等级的命名空间

Zookeeper的命名空间的结构和文件系统很像。一个名字和文件一样使用/的路径表现，zookeeper的每个节点都是被路径唯一标识

3 zookeeper存储架构

zookeeper中的数据是按照“树”结构进行存储的。而且znode节点还分为4中不同的类型。

（1）、znode

根据本小结第一部分的描述，很显然zookeeper集群自身维护了一套数据结构。这个存储结构是一个树形结构，其上的每一个节点，我们称之为“znode”。如下如所示：

· 每一个znode默认能够存储1MB的数据（对于记录状态性质的数据来说，够了）

· 可以使用zkCli命令，登录到zookeeper上，并通过ls、create、delete、sync等命令操作这些znode节点

· znode除了名称、数据以外，还有一套属性：zxid。这套zid与时间戳对应，记录zid不同的状态（后续我们将用到）

那么每个znode结构又是什么样的呢？如下图所示：

此外，znode还有操作权限。如果我们把以上几类属性细化，又可以得到以下属性的细节：

czxid：创建节点的事务的zxid
mzxid：对znode最近修改的zxid
ctime：以距离时间原点(epoch)的毫秒数表示的znode创建时间
mtime：以距离时间原点(epoch)的毫秒数表示的znode最近修改时间
version：znode数据的修改次数
cversion：znode子节点修改次数
aversion：znode的ACL修改次数
ephemeralOwner：如果znode是临时节点，则指示节点所有者的会话ID；如果不是临时节点，则为零。
dataLength：znode数据长度。
numChildren：znode子节点个数。

(2)、znode中的存在类型

我们知道了zookeeper内部维护了一套数据结构：由znode构成的集合，znode的集合又是一个树形结构。每一个znode又有很多属性进行描述。并且znode的存在性还分为四类，如下如所示：

znode是由客户端创建的，它和创建它的客户端的内在联系，决定了它的存在性：

· PERSISTENT-持久化节点：创建这个节点的客户端在与zookeeper服务的连接断开后，这个节点也不会被删除（除非您使用API强制删除）。

· PERSISTENT_SEQUENTIAL-持久化顺序编号节点：当客户端请求创建这个节点A后，zookeeper会根据parent-znode的zxid状态，为这个A节点编写一个全目录唯一的编号（这个编号只会一直增长）。当客户端与zookeeper服务的连接断开后，这个节点也不会被删除。

· EPHEMERAL-临时目录节点：创建这个节点的客户端在与zookeeper服务的连接断开后，这个节点（还有涉及到的子节点）就会被删除。

· EPHEMERAL_SEQUENTIAL-临时顺序编号目录节点：当客户端请求创建这个节点A后，zookeeper会根据parent-znode的zxid状态，为这个A节点编写一个全目录唯一的编号（这个编号只会一直增长）。当创建这个节点的客户端与zookeeper服务的连接断开后，这个节点被删除。

· 另外，无论是EPHEMERAL还是EPHEMERAL_SEQUENTIAL节点类型，在zookeeper的client异常终止后，节点也会被删除。

4 Zookeeper 数据结构

Zookeeper这种数据结构有如下这些特点：

每个子目录项如 NameService 都被称作为 znode，这个 znode 是被它所在的路径唯一标识，如 Server1 这个 znode 的标识为 /NameService/Server1
znode 可以有子节点目录，并且每个 znode 可以存储数据，注意 EPHEMERAL 类型的目录节点不能有子节点目录
znode 是有版本的，每个 znode 中存储的数据可以有多个版本，也就是一个访问路径中可以存储多份数据
znode 可以是临时节点，一旦创建这个 znode 的客户端与服务器失去联系，这个 znode 也将自动删除，Zookeeper 的客户端和服务器通信采用长连接方式，每个客户端和服务器通过心跳来保持连接，这个连接状态称为 session，如果 znode 是临时节点，这个 session 失效，znode 也就删除了
znode 的目录名可以自动编号，如 App1 已经存在，再创建的话，将会自动命名为 App2
znode 可以被监控，包括这个目录节点中存储的数据的修改，子节点目录的变化等，一旦变化可以通知设置监控的客户端，这个是 Zookeeper 的核心特性，Zookeeper 的很多功能都是基于这个特性实现的，后面在典型的应用场景中会有实例介绍

(3)、zk中的选举FastLeaderELection

我们已经知道了一个zookeeper集群中，有一个处于leader身份的节点，其他的节点都是flower状态。那么一个leader是怎么产生的呢？这就是zookeeper中的选举规则，默认的选举规则称为：FastLeaderELection（网上的资料还有提到另外的选举算法，实际上它们的核心思想都是一样的）

3.1、选举算法的中心思想

网上的资料有使用纯文字进行算法描述的，也有使用流程图进行算法描述的，但是如果读者不仔细看，还是容易昏头转向，这里我们使用一张过程图和文字相结合的方式对FastLeaderELection选举算法进行描述。实际上FastLeaderELection说的中心思想无外乎以下几个关键点：

· 全天下我最牛，在我没有发现比我牛的推荐人的情况下，我就一直推举我当leader。第一次投票那必须推举我自己当leader。

· 每当我接收到其它的被推举者，我都要回馈一个信息，表明我还是不是推举我自己。如果被推举者没我大，我就一直推举我当leader，是我是我还是我！

· 我有一个票箱，和我属于同一轮的投票情况都在这个票箱里面。一人一票重复的或者过期的票，我都不接受。

· 一旦我不再推举我自己了（这时我发现别人推举的人比我推荐的更牛），我就把我的票箱清空，重新发起一轮投票（这时我的票箱一定有两票了，都是选的我认为最牛的人）。

· 一旦我发现收到的推举信息中投票轮要高于我的投票轮，我也要清空我的票箱。并且还是投当初我觉得最牛的那个人（除非当前的人比我最初的推荐牛，我就顺带更新我的推荐）。

· 不断的重复上面的过程，不断的告诉别人“我的投票是第几轮”、“我推举的人是谁”。直到我的票箱中“我推举的最牛的人”收到了不少于 N /2 + 1的推举投票。

· 这时我就可以决定我是flower还是leader了（如果至始至终都是我最牛，那我就是leader咯，其它情况就是follower咯）。并且不论随后收到谁的投票，都向它直接反馈“我的结果”。

上图是网络上的一张选举过程图，步骤是怎么样的，笔者我就不再多说了，只希望这个能辅助大家更好的理解选举过程。

哦，现在您知道为什么zookeeper在少于 N + 1 / 2的节点处于工作状态的情况下会崩溃了吧。因为，无论怎么选也没有任何节点能够获得 N + 1 / 2 的票数。

以上参考资料来至互联网部分博客，请尊重原创！

2.4.4.2 下载或则上传tar包到

注意：zookeeper我们只装在node3,node4,node5上即可（安装台数要是基数）

去官网下载zookeeper-3.4.8tar包，直接下载到/app/opt/zookeeper/目录下，也可以下载到本地，然后上传至服务器的/app/opt/zookeeper目录下，然后解压

2.4.4.3 安装配置

解压后进入zookeeper的conf目录下

然后cp zoo_sample.cfg zoo.cfg之后vim zoo.cfg文件，我的配置如下

# The number of milliseconds of each tick

tickTime=2000

# The number of ticks that the initial

# synchronization phase can take

initLimit=10

# The number of ticks that can pass between

# sending a request and getting anacknowledgement

syncLimit=5

# the directory where the snapshot is stored.

# do not use /tmp for storage, /tmp here isjust

# example sakes.

dataDir=/app/data/zookeeper/data

dataLogDir=/app/logs/zookeeper

# the port at which the clients will connect

clientPort=2181

# the maximum number of client connections.

# increase this if you need to handle moreclients

#maxClientCnxns=60

# Be sure to read the maintenance section ofthe

# administrator guide before turning onautopurge.

#http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance

# The number of snapshots to retain in dataDir

#autopurge.snapRetainCount=3

# Purge task interval in hours

# Set to "0" to disable auto purgefeature

#autopurge.purgeInterval=1

server.3=node3:2888:3888

server.4=node4:2888:3888

server.5=node5:2888:3888

然后分别在node3，node4，node5的/app/data/zookeeper/data/目录下创建myid文件，文件内容分别为3,4,5

然后我们把zookeeper的启动脚本添加到环境变量里边，添加环境变量在这里就不啰嗦了。

OK。到此zookeeper安装完成，下面我们启动zookeeper集群

分别在node3，node4，node5上启动zookeeper

/app/opt/zookeeper/zookeeper-3.4.8//bin/zkServer.shstart 启动

/app/opt/zookeeper/zookeeper-3.4.8//bin/zkServer.shstop 停止

/app/opt/zookeeper/zookeeper-3.4.8//bin/zkServer.shstatus 查兰状态

至此，zookeeper集群就已经OK

2.4.4.4 Hadoop格式化启动

进入五台服务器的/app/opt/hadoop/hadoop-2.7.3/sbin目录下

1 在node1上执行

hdfszkfc -formatZK 使hadoop和zk集群发生关联

发生关系成功，接下来进行下一步

2 分别在node3，node4，node5上执行hadoop-daemon.sh start journalnode

用来启动journalnode服务

3 格式化namenode、启动namenode

在node1服务器hadoop主服务器的bin下执行（我们用node作为主节点，node2作为备用节点）

hdfsnamenode –format

格式化成功，进入下一步

在sbin下启动namenode

hadoop-daemon.sh start namenode

在node2上执行bin/hdfsnamenode –bootstrapStandby

然后在node2上执行hadoop-daemon.sh start namenode

此时node备用的namenode已经启动

4启动datanode

在node1上执行hadoop-daemons.sh start datanode

Datanode会在node2，node3，node4，node5上自动启动

4 启动ZKFC

在node1，node2上分别启动hadoop-daemon.sh start zkfc

至此namenode的高可用已经完全启动，下面我们就行验证

2.4.5 namenode高可用验证

1 首先通过浏览器访问192.168.21.161:50070

从上图可以看出node1是活跃状态，下边看看node2

从上图可以看到node2是备用状态，那么现在我们杀死node1的namenode进程，在查看node1已经无法访问了

那么我们看看node2

OK，这说明node1出问题时，node2会自动转化为活跃状态，那么我们再次启动node1，看看node1是不是备用状态，如果是，那么就OK了

到这里namenode高可用验证成功，就OK了

2.4.6 resourcemanage高可用

在node1上执行start-yarn.sh，效果如下

然后在node2上执行yarn-daemon.shstart resourcemanager，我们发现无论我们怎么访问node2:8088，还是直接转移到node1上了，这说明高可用是生效的，那么我们不妨关闭node1的resourcemanage进程试试看

说明node2已经代替了node1，然后我们在启动node1的rm进程，访问node1:8088结果又转移到了node2上，到这里hadoop集群就算是完成了，其实这些东西用起来很简单的哦！

然后选择centos64和linux，如下，然后点击下一步

选择虚拟机上机器的名称已经安装机器的位置，然后下一步

然后选择

选择自定义硬件，如下

接着上边说，配置完硬件后点击完成，点击开启虚拟机，为了简单以下截图而过了

说明:

1.install or upgrade an existing system
安装或升级现有系统

2.install system with basic video drive
安装过程中采用基本的显卡驱动

3.rescue installed system
进入系统修复模式

4.Boot from local drive
退出安装从硬盘启动

5.memory test
内存检测

6.press tab to edit options
按“tab”来编辑选项

在这里选择1就可以了

然后如下选择

，，

6.选择英文或者中单都可以,在这里推荐选择英文

然后选择美式键盘

这里选择基本存储，无需要选择指定存储如下

这里选择格式化任何数据

小编的网络配置如下

根据自己具体的网络情况定义

编辑完成后点击apply，此时已经退出编辑页面，然后点击下一步

选择亚洲，上海

设置服务器的密码

然后选择使用使用所有空间，不再做过多分区介绍了，感兴趣自己研究下，如下

然后选择如下，因为是实验环境为了方便在上边开发，我们安装了桌面模式，真正生产中是只选择基础模式就可以了，如下

然后就是系统安装了，然后

后边都直接选择默认即可。

我们登录系统后需要通过root用户在根目录下创建个数据目录，来模仿企业的数据盘，同时将app目录授权给hadoop，执行命令如下

chown -R hadoop.hadoop app我们将所有的安装，然后

我们在app下建立两个目录，opt和pro，到这里我们的系统算是差不多了，需要用到的后边在补写。

2.3大数据平台基础依赖安装

2.3.1 jdk安装

在这里首先说一下，系统默认会带openjdk，系统自带的这种不好用，我在这里就直接卸载了，命令如下

以root权限执行rpm -qa | grep jdk ，如下

然后 yum -y remove 你所过滤出来的包，我的卸载如下

rpm -e --nodeps java-1.7.0-openjdk-1.7.0.45-2.4.3.3.el6.x86_64

rpm -e --nodepsjava-1.6.0-openjdk-1.6.0.0-1.66.1.13.0.el6.x86_64

然后检查一下是否卸载了，我们可以执行个java，然后报错了，说明卸载干了

在这里我们选择1.8，因为scala等还需要依赖（1.7也可以，不推荐1.6，1.6用的人很少了）

具体jdk安装请查看百度，在这里不做过多的介绍，只做简单介绍

我的JDK放在这个目录下了，如下图所示

然后，我们配置系统级别的环境变量，命令如下

vim /etc/profile

然后在这里添加环境变量即可

export JAVA_HOME=/app/opt/jdk/jdk1.8.0_141

export PATH=$PATH:$JAVA_HOME/bin:

然后source/etc/profile

OK，这样我们的jdk就安装好了，可以验证一下

OK，到这里说明jdk已经安装好了，环境变量也生效了

2.3.2 scala安装

由于后边我们玩spark，spark是用scala语言开发的，所以我们需要安装scala语言

一般系统默认是不会自带scala的，所以不需要卸载，下面我们开始安装

同样下载scala tar包，我这里下载的是

我的安装目录在这里

然后vm/etc/profile

然后source/etc/profile

验证和java相同，在这里就不再介绍了

2.3.3 Python

由于系统自带的Python是2.6.6,可以满足我们可基本需求，在这里我就不再安装了，需要安装的请自行百度，很简单的！

2.3.4系统准备

2.3.4.1准备五台服务器

2.3.4.2 其他设置

1 cd /etc/udev/rules.d/

2 现将cp 70-persistent-net.rules70-persistent-net.rules.bak 文件备份一下在编辑，以防出错

3 vim 70-persistent-net.rules 将eth0那行注释掉，将eth1改成eth0，如下

然后重启网卡，servicenetwork restart即可生效，

5 因为是复制的虚拟机，需要将机器名依次修改成node1到node5

6 命令： hostname node2 临时将机器名修改为node2，如果重启机器又会恢复原来的机器名，故需要永久修改，修改命令如下

vim /etc/sysconfig/network 如下

’

7 关闭selinu，默认是开启的，需要关闭，常用命令如下

sestatus 查看状态

setenforce 0 临时关闭

vim /etc/selinux/config 将SELINUX=enforcing修改为disabled

8 关闭防护墙

serviceiptables status 查看防火墙状态

serviceiptables stop 关闭防火墙

chkconfigiptables off 永久关闭防火墙

9 修改系统最大打开文件数量，默认为1024，我们将其修改为65535

解除Linux 系统的最大进程数和最大文件打开数限制：

修改所有linux 用户的环境变量文件

ulimit –a 查看linux相关信息

ulimit -n 65535 临时修改为这么大

永久修改如下

将开机启动默认设置为字符界面，vim /etc/inittab 将id:5:修改为id:3重启即可

到这里系统级别的依赖基本上差不多了，组件需要我们在后边在安装，比如ntp等

11 这样我们将5台机器出了IP和机器名都配置不同，其他配置全部相同（当然硬件配置可以根据主从节点修改）

以node1的hadoop用户为例

进入/home/hadoop/.ssh目录

执行 ssh-keygen -t dsa 生成公钥和私钥

然后将公钥追加到一个文件 cat id_dsa.pub >> authorized_keys

然后将authorized_keys拷贝到node2,3,4,5服务器上

最后在node下执行chmod 600 authorized_keys即可实现node1到其他节点的免秘钥登录，其他节点同理，这里不再累赘了！

2.4 hadoop安装

2.4.1 hadoop的下载，官方下载hadoop编译好的包也可以下载源码自行编译，在这里我们下载官方预编译版本2.7.3吧，用一下新版本

2.4.1解压

将hadoop-2.7.3.tar.gz包放再/app/opt/hadoop/下然后解压

2.4.2目录介绍

进入cdhadoop-2.7.3/ ll

目录介绍

2.4.2.1 bin

bin 目录介绍

主要是存放一下hadoop命令

container-executor 执行container

hadoop 执行hadoop相关命令，比如hadoop fs等（linux环境运用）

hadoop.cmd 执行hadoop相关命令，比如hadoop fs等（windows环境运用）

hdfs 执行hdfs相关命令，比如hdds dfsadmin等（linux环境运用）

hdfs.cmd 执行hdfs相关命令，比如hdds dfsadmin等（windows环境运用）

mapred 执行map命令（linux运用）

mapred.cmd 执行map命令（windows运用）

rcc 用来生成 java和c++的hadoop Record 类代码

test-container-executor 用来测试container-executor

yarn 用来执行yarn的相关命令比如yarn –node list（linux）

yarn.cmd 用来执行yarn的相关命令比如yarn –node list（windows）

具体详细命令请查看官方文档或者帮助文件

2.4.2.2 etc

etc下边的hadoop目录下存放着hadoop的配置文件，默认如下

capacity-scheduler.xml capacity调度相关配置文件，主要是和yarn结合使用

configuration.xsl 根元素配置（不是特别清楚）

container-executor.cfg container-executor执行先关配置

core-site.xml hadoop核心配置文件

hadoop-env.cmd hadoop启动运行环境（windows）

hadoop-env.sh hadoop启动运行环境（linux）

hadoop-metrics2.properties 控制Hadoop报告

hadoop-metrics.properties 控制Hadoop报告

hadoop-policy.xml 权限管理相关配置

hdfs-site.xml hdfs配置文件

httpfs-env.sh httpfs启动环境配置

httpfs-log4j.properties httpfs-log4j配合模板

httpfs-signature.secret 好像是一个认证密令的配置（不太清楚）

httpfs-site.xml httpfs配置

kms-acls.xml kms安装认证

kms-env.sh kms启动运行环境

kms-log4j.properties kms 日志模板

kms-site.xml kms配置

log4j.properties log4j模板

mapred-env.cmd map启动运行时环境（windows）

mapred-env.sh map启动运行时环境（linux）

mapred-queues.xml.template map队列模板配置

mapred-site.xml.template map配置文件

slaves 从节点配置

ssl-client.xml.example ssl客户端配置模板

ssl-server.xml.example ssh服务端配置模板

yarn-env.cmd yarn启动时运行环境（windows）

yarn-env.sh yarn启动时运行环境（linux）

yarn-site.xml yarn配置文件

2.4.2.4 include

对外提供的编程库文件，与lib结合相关，这些头文件均是用C++定义的，通常用于C++访问dfs和编写mr

2.4.2.5 lib

该目录包含了Hadoop对外提供的编程动态库和静态库，与include目录中的头文件结合使用

2.4.2.6 libexec

用来提供hadoop相关启动命令的shell初始化调用。

2.4.2.7 sbin

distribute-exclude.sh 没用过，不太清楚

hadoop-daemon.sh 启动单个任务

hadoop-daemons.sh 启动多个任务

hdfs-config.cmd hdfs-config调用（windows）

hdfs-config.sh hdfs-config调用（linux）

httpfs.sh httpfs执行脚本

kms.sh kms执行脚本

mr-jobhistory-daemon.sh mr启动job历史服务

refresh-namenodes.sh 恢复namenode脚本

slaves.sh 执行datanode节点

start-all.cmd 启动所有组件（windows）

start-all.sh 启动所有组件（linux）

start-balancer.sh 手动负载均衡

start-dfs.cmd 启动dfs（windows）

start-dfs.sh 启动dfs（linux）

start-secure-dns.sh 启动secure-dns

start-yarn.cmd 启动yarn（windows）

start-yarn.sh 启动yarn（linux）

stop-all.cmd 停止hadoop所有组件（windows）

stop-all.sh 停止hadoop所有组件（linux）

stop-balancer.sh 停止负载均衡

stop-dfs.cmd 停止dfs（windows）

stop-dfs.sh 停止dfs（linux）

stop-secure-dns.sh 停止secure-dns

stop-yarn.cmd 停止yarn（windows）

stop-yarn.sh 停止yarn（linux）

yarn-daemon.sh 停止yarn的单个节点的yarn相关组件

yarn-daemons.sh 停止yarn的多个节点的yarn相关组件

2.4.2.8 share

共享的demo和编译后的相关jar包等

2.4.3 hadoop配置文件配置

2.4.3.1 capacity-scheduler.xml配置

<!--

Licensed under the Apache License, Version 2.0 (the"License");

youmay not use this file except in compliance with the License.

Youmay obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS"BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Seethe License for the specific language governing permissions and

limitations under the License. See accompanying LICENSE file.

-->

<name>yarn.scheduler.capacity.root.queues</name>

<value>offline,online,default</value>

<description>The sub queues in the newly created root queueinclude offline queues, online queues, default queues</description>

</property>

<name>yarn.scheduler.capacity.root.offline.capacity</name>

<description>Offline occupancy ratio</description>

</property>

<name>yarn.scheduler.capacity.root.offline.queues</name>

<value>hive,sqoop,spark_thrift</value>

</property>

<name>yarn.scheduler.capacity.root.offline.hive.capacity</name>

</property>

<name>yarn.scheduler.capacity.root.offline.hive.user-limit-factor</name>

<description>the limit ofeveryuser use resource</description>

</property>

<name>yarn.scheduler.capacity.root.offline.hive.maximum-capacity</name>

</property>

<name>yarn.scheduler.capacity.root.offline.sqoop.capacity</name>

</property>

<name>yarn.scheduler.capacity.root.offline.sqoop.user-limit-factor</name>

</property>

<name>yarn.scheduler.capacity.root.offline.sqoop.maximum-capacity</name>

</property>

<name>yarn.scheduler.capacity.root.offline.spark_thrift.capacity</name>

</property>

<name>yarn.scheduler.capacity.root.offline.spark_thrift.user-limit-factor</name>

</property>

<name>yarn.scheduler.capacity.root.offline.spark_thrift.maximum-capacity</name>

</property>

<name>yarn.scheduler.capacity.root.online.maximum-capacity</name>

The maximum capacity of thedefault queue.

</description>

</property>

<name>yarn.scheduler.capacity.root.online.capacity</name>

</property>

<name>yarn.scheduler.capacity.maximum-applications</name>

Maximum number of applications that can be pending and running.

</description>

</property>

<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>

Maximum percent of resources in the cluster which can be used to run

application masters i.e. controls number of concurrent running

applications.

</description>

</property>

<name>yarn.scheduler.capacity.resource-calculator</name>

<value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>

The ResourceCalculator implementation to be used to compare

Resources in the scheduler.

The default i.e. DefaultResourceCalculator only uses Memory while

DominantResourceCalculator uses dominant-resource to compare

multi-dimensional resources such as Memory, CPU etc.

</description>

</property>

<name>yarn.scheduler.capacity.root.queues</name>

<value>default</value>

The queues at the this level (root is the root queue).

</description>

</property>

<name>yarn.scheduler.capacity.root.default.capacity</name>

<description>Default queue target capacity.</description>

</property>

<name>yarn.scheduler.capacity.root.default.user-limit-factor</name>

Default queue user limit a percentage from 0.0 to 1.0.

</description>

</property>

<name>yarn.scheduler.capacity.root.default.maximum-capacity</name>

The maximum capacity of the default queue.

</description>

</property>

<name>yarn.scheduler.capacity.root.default.state</name>

<value>RUNNING</value>

The state of the default queue. State can be one of RUNNING or STOPPED.

</description>

</property>

<name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>

The ACL of who can submit jobs to the default queue.

</description>

</property>

<name>yarn.scheduler.capacity.root.default.acl_administer_queue</name>

The ACL of who can administer jobs on the default queue.

</description>

</property>

<name>yarn.scheduler.capacity.node-locality-delay</name>

Number of missed scheduling opportunities after which theCapacityScheduler

attempts to schedule rack-local containers.

Typically this should be set to number of nodes in the cluster, Bydefault is setting

approximately number of nodes in one rack which is 40.

</description>

</property>

<name>yarn.scheduler.capacity.queue-mappings</name>

A list of mappings that will be used to assign jobs to queues

The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]*

Typically this list will be used to map users to queues,

for example, u:%user:%user maps all users to queues with the same name

as the user.

</description>

</property>

<name>yarn.scheduler.capacity.queue-mappings-override.enable</name>

<value>false</value>

If a queue mapping is present, will it override the value specified

by the user? This can be used by administrators to place jobs in queues

that are different than the one specified by the user.

The default is false.

</description>

</property>

</configuration>

2.4.3.2 core-site.xml配置

<?xml version="1.0"encoding="UTF-8"?>

<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>

<!--

Licensed under the Apache License, Version 2.0 (the"License");

youmay not use this file except in compliance with the License.

Youmay obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS"BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Seethe License for the specific language governing permissions and

limitations under the License. See accompanying LICENSE file.

-->

<name>fs.defaultFS</name>

</property>

<name>hadoop.tmp.dir</name>

<value>/app/data/hadoop/hdfs/tmp</value>

</property>

<name>ha.zookeeper.quorum</name>

</property>

<name>dfs.datanode.max.transfer.threads</name>

</property>

<name>hadoop.proxyuser.hue.hosts</name>

</property>

<name>hadoop.proxyuser.hue.groups</name>

</property>

<name>hadoop.security.instrumentation.requires.admin</name>

<value>false</value>

</property>

</configuration>

2.4.3.3 hadoop-env.sh 配置

# Licensed to the Apache SoftwareFoundation (ASF) under one

# or more contributor licenseagreements. See the NOTICE file

# distributed with this work for additionalinformation

# regarding copyright ownership. The ASF licenses this file

# to you under the Apache License, Version2.0 (the

# "License"); you may not usethis file except in compliance

# with the License. You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law oragreed to in writing, software

# distributed under the License isdistributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANYKIND, either express or implied.

# See the License for the specific languagegoverning permissions and

# limitations under the License.

# Set Hadoop-specific environment variableshere.

# The only required environment variable isJAVA_HOME. All others are

# optional. When running a distributed configuration it is best to

# set JAVA_HOME in this file, so that it iscorrectly defined on

# remote nodes.

# The java implementation to use.

export JAVA_HOME=/app/opt/jdk/jdk1.8.0_141

# The jsvc implementation to use. Jsvc isrequired to run secure datanodes

# that bind to privileged ports to provide authenticationof data transfer

# protocol. Jsvc is not required if SASL is configured for authentication of

# data transfer protocol usingnon-privileged ports.

#export JSVC_HOME=${JSVC_HOME}

exportHADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

# Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.

for f in$HADOOP_HOME/contrib/capacity-scheduler/*.jar; do

if[ "$HADOOP_CLASSPATH" ]; then

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f

else

export HADOOP_CLASSPATH=$f

done

# The maximum amount of heap to use, in MB.Default is 1000.

#export HADOOP_HEAPSIZE=

#exportHADOOP_NAMENODE_INIT_HEAPSIZE=""

# Extra Java runtime options. Empty by default.

export HADOOP_OPTS="$HADOOP_OPTS-Djava.net.preferIPv4Stack=true"

# Command specific options appended toHADOOP_OPTS when specified

exportHADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

exportHADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS}-Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender}$HADOOP_SECONDARYNAMENODE_OPTS"

exportHADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"

export HADOOP_PORTMAP_OPTS="-Xmx512m$HADOOP_PORTMAP_OPTS"

# The following applies to multiplecommands (fs, dfs, fsck, distcp etc)

export HADOOP_CLIENT_OPTS="-Xmx512m$HADOOP_CLIENT_OPTS"

#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData$HADOOP_JAVA_PLATFORM_OPTS"

# On secure datanodes, user to run thedatanode as after dropping privileges.

# This **MUST** be uncommented to enablesecure HDFS if using privileged ports

# to provide authentication of datatransfer protocol. This **MUST NOT** be

# defined if SASL is configured forauthentication of data transfer protocol

# using non-privileged ports.

exportHADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}

# Where log files are stored. $HADOOP_HOME/logs by default.

export HADOOP_LOG_DIR=/app/logs/hadoop

# Where log files are stored in the securedata environment.

exportHADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}

###

# HDFS Mover specific parameters

###

# Specify the JVM options to be used whenstarting the HDFS Mover.

# These options will be appended to theoptions specified as HADOOP_OPTS

# and therefore may override any similarflags set in HADOOP_OPTS

# export HADOOP_MOVER_OPTS=""

###

# Advanced Users Only!

###

# The directory where pid files are stored./tmp by default.

# NOTE: this should be set to a directorythat can only be written to by

# the user that will run the hadoop daemons. Otherwise there is the

# potential for a symlink attack.

export HADOOP_PID_DIR=/app/data/hadoop/pids

export HADOOP_SECURE_DN_PID_DIR=/app/data/hadoop/pids

# A string representing this instance ofhadoop. $USER by default.

export HADOOP_IDENT_STRING=$USER

2.4.3.4 hdfs-site.xml配置

<?xml version="1.0"encoding="UTF-8"?>

<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>

<!--

Licensed under the Apache License, Version 2.0 (the"License");

youmay not use this file except in compliance with the License.

Youmay obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS"BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Seethe License for the specific language governing permissions and

limitations under the License. See accompanying LICENSE file.

-->

<name>dfs.nameservices</name>

</property>

<name>dfs.ha.namenodes.ns</name>

</property>

<name>dfs.namenode.rpc-address.ns.nn1</name>

</property>

<name>dfs.namenode.http-address.ns.nn1</name>

</property>

<name>dfs.namenode.rpc-address.ns.nn2</name>

</property>

<name>dfs.namenode.http-address.ns.nn2</name>

</property>

<name>dfs.namenode.shared.edits.dir</name>

<value>qjournal://node3:8485;node4:8485;node5:8485/ns</value>

</property>

<name>dfs.journalnode.edits.dir</name>

<value>/app/data/hadoop/hdfs/journal</value>

</property>

<name>dfs.ha.automatic-failover.enabled</name>

</property>

<name>dfs.client.failover.proxy.provider.ns</name>

<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

</property>

<name>dfs.ha.fencing.methods</name>

<value>sshfence</value>

</property>

<name>dfs.ha.fencing.ssh.private-key-files</name>

<value>/home/hadoop/.ssh/id_rsa</value>

</property>

<name>dfs.namenode.name.dir</name>

<value>file:///app/data/hadoop/hdfs/name</value>

</property>

<name>dfs.datanode.data.dir</name>

<value>file:///app/data/hadoop/hdfs/data</value>

</property>

<name>dfs.replication</name>

</property>

</configuration>

2.4.3.5 mapred-site.xml配置

<?xml version="1.0"?>

<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>

<!--

Licensed under the Apache License, Version 2.0 (the"License");

youmay not use this file except in compliance with the License.

Youmay obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS"BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Seethe License for the specific language governing permissions and

limitations under the License. See accompanying LICENSE file.

-->

<name>mapreduce.framework.name</name>

</property>

<name>mapreduce.map.java.opts</name>

</property>

<name>mapreduce.reduce.java.opts</name>

</property>

<name>yarn.app.mapreduce.am.resource.mb</name>

</property>

<name>mapreduce.map.output.compress</name>

<value>false</value>

</property>

<name>mapreduce.output.fileoutputformat.compress</name>

<value>false</value>

</property>

<name>mapreduce.output.fileoutputformat.compress.type</name>

<value>BLOCK</value>

</property>

<name>mapreduce.jobhistory.address</name>

</property>

<name>mapreduce.jobhistory.webapp.address</name>

</property>

<name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name>

<value>false</value>

</property>

</configuration>

2.4.3.6 slave配置

node2

node3

node4

node5

2.4.3.6 yarn-site-xml 配置

<?xml version="1.0"?>

<!--

Licensed under the Apache License, Version 2.0 (the"License");

youmay not use this file except in compliance with the License.

Youmay obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS"BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Seethe License for the specific language governing permissions and

limitations under the License. See accompanying LICENSE file.

-->

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<name>yarn.resourcemanager.ha.enabled</name>

</property>

<name>yarn.resourcemanager.cluster-id</name>

</property>

<name>yarn.resourcemanager.ha.rm-ids</name>

</property>

<name>yarn.resourcemanager.hostname.rm1</name>

</property>

<name>yarn.resourcemanager.hostname.rm2</name>

</property>

<name>yarn.resourcemanager.webapp.address.rm1</name>

</property>

<name>yarn.resourcemanager.webapp.address.rm2</name>

</property>

<name>yarn.resourcemanager.zk-address</name>

</property>

<name>yarn.nodemanager.resource.memory-mb</name>

</property>

<name>yarn.scheduler.maximum-allocation-mb</name>

</property>

<name>yarn.scheduler.minimum-allocation-mb</name>

</property>

<name>yarn.nodemanager.vmem-check-enabled</name>

<value>false</value>

<description>Whether virtual memory limits will be enforcedfor containers</description>

</property>

<name>yarn.nodemanager.vmem-pmem-ratio</name>

<description>Ratio between virtual memory to physical memory whensetting memory limits for containers</description>

</property>

<name>yarn.nodemanager.resource.cpu-vcores</name>

</property>

<name>yarn.scheduler.maximum-allocation-vcores</name>

</property>

<name>yarn.scheduler.minium-allocation-vcores</name>

</property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>

</property>

<name>yarn.nodemanager.log-dirs</name>

</property>

<name>yarn.log-aggregation-enable</name>

</property>

</configuration>

目前我们先配置到这里吧，如果有其他需要改动的，或则修改的配置，我们可以再修改，水平有限，多多包涵！

2.4.4 zookeeper安装

2.4.4.1 zookeeper简单介绍

什么是ZooKeeper？

1 基础扫盲

zookeeper以角色的形式存在，那么zookeeper设计到的角色有哪些呢？

Follower follower用来接收客户端（默认2181）端口发来的消息，将处理结果返回给客户端，并且参与leader选举投票

Observer Observer是用来接收客户端请求，转发给leader的功能，不参与投票

Client Client作用是向zookeeper集群发起请求

2 zookeeper总架构

(2)Zookeeper是有序的

Zookeeper使用数字来对每一个更新进行标记。这样能保证Zookeeper交互的有序。后续的操作可以根据这个顺序实现诸如同步操作这样更高更抽象的服务。

(3)Zookeeper是高效的

Zookeeper的高效更表现在以读为主的系统上。Zookeeper可以在千台服务器组成的读写比例大约为10:1的分布系统上表现优异。

(4)数据结构和分等级的命名空间

Zookeeper的命名空间的结构和文件系统很像。一个名字和文件一样使用/的路径表现，zookeeper的每个节点都是被路径唯一标识

3 zookeeper存储架构

zookeeper中的数据是按照“树”结构进行存储的。而且znode节点还分为4中不同的类型。

（1）、znode

· 每一个znode默认能够存储1MB的数据（对于记录状态性质的数据来说，够了）

· 可以使用zkCli命令，登录到zookeeper上，并通过ls、create、delete、sync等命令操作这些znode节点

· znode除了名称、数据以外，还有一套属性：zxid。这套zid与时间戳对应，记录zid不同的状态（后续我们将用到）

那么每个znode结构又是什么样的呢？如下图所示：

此外，znode还有操作权限。如果我们把以上几类属性细化，又可以得到以下属性的细节：

czxid：创建节点的事务的zxid
mzxid：对znode最近修改的zxid
ctime：以距离时间原点(epoch)的毫秒数表示的znode创建时间
mtime：以距离时间原点(epoch)的毫秒数表示的znode最近修改时间
version：znode数据的修改次数
cversion：znode子节点修改次数
aversion：znode的ACL修改次数
ephemeralOwner：如果znode是临时节点，则指示节点所有者的会话ID；如果不是临时节点，则为零。
dataLength：znode数据长度。
numChildren：znode子节点个数。

(2)、znode中的存在类型

znode是由客户端创建的，它和创建它的客户端的内在联系，决定了它的存在性：

· PERSISTENT-持久化节点：创建这个节点的客户端在与zookeeper服务的连接断开后，这个节点也不会被删除（除非您使用API强制删除）。

· EPHEMERAL-临时目录节点：创建这个节点的客户端在与zookeeper服务的连接断开后，这个节点（还有涉及到的子节点）就会被删除。

· 另外，无论是EPHEMERAL还是EPHEMERAL_SEQUENTIAL节点类型，在zookeeper的client异常终止后，节点也会被删除。

4 Zookeeper 数据结构

Zookeeper这种数据结构有如下这些特点：

每个子目录项如 NameService 都被称作为 znode，这个 znode 是被它所在的路径唯一标识，如 Server1 这个 znode 的标识为 /NameService/Server1
znode 可以有子节点目录，并且每个 znode 可以存储数据，注意 EPHEMERAL 类型的目录节点不能有子节点目录
znode 是有版本的，每个 znode 中存储的数据可以有多个版本，也就是一个访问路径中可以存储多份数据
znode 可以是临时节点，一旦创建这个 znode 的客户端与服务器失去联系，这个 znode 也将自动删除，Zookeeper 的客户端和服务器通信采用长连接方式，每个客户端和服务器通过心跳来保持连接，这个连接状态称为 session，如果 znode 是临时节点，这个 session 失效，znode 也就删除了
znode 的目录名可以自动编号，如 App1 已经存在，再创建的话，将会自动命名为 App2
znode 可以被监控，包括这个目录节点中存储的数据的修改，子节点目录的变化等，一旦变化可以通知设置监控的客户端，这个是 Zookeeper 的核心特性，Zookeeper 的很多功能都是基于这个特性实现的，后面在典型的应用场景中会有实例介绍

(3)、zk中的选举FastLeaderELection

3.1、选举算法的中心思想

· 全天下我最牛，在我没有发现比我牛的推荐人的情况下，我就一直推举我当leader。第一次投票那必须推举我自己当leader。

· 我有一个票箱，和我属于同一轮的投票情况都在这个票箱里面。一人一票重复的或者过期的票，我都不接受。

上图是网络上的一张选举过程图，步骤是怎么样的，笔者我就不再多说了，只希望这个能辅助大家更好的理解选举过程。

哦，现在您知道为什么zookeeper在少于 N + 1 / 2的节点处于工作状态的情况下会崩溃了吧。因为，无论怎么选也没有任何节点能够获得 N + 1 / 2 的票数。

以上参考资料来至互联网部分博客，请尊重原创！

2.4.4.2 下载或则上传tar包到

注意：zookeeper我们只装在node3,node4,node5上即可（安装台数要是基数）

去官网下载zookeeper-3.4.8tar包，直接下载到/app/opt/zookeeper/目录下，也可以下载到本地，然后上传至服务器的/app/opt/zookeeper目录下，然后解压

2.4.4.3 安装配置

解压后进入zookeeper的conf目录下

然后cp zoo_sample.cfg zoo.cfg之后vim zoo.cfg文件，我的配置如下

# The number of milliseconds of each tick

tickTime=2000

# The number of ticks that the initial

# synchronization phase can take

initLimit=10

# The number of ticks that can pass between

# sending a request and getting anacknowledgement

syncLimit=5

# the directory where the snapshot is stored.

# do not use /tmp for storage, /tmp here isjust

# example sakes.

dataDir=/app/data/zookeeper/data

dataLogDir=/app/logs/zookeeper

# the port at which the clients will connect

clientPort=2181

# the maximum number of client connections.

# increase this if you need to handle moreclients

#maxClientCnxns=60

# Be sure to read the maintenance section ofthe

# administrator guide before turning onautopurge.

#http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance

# The number of snapshots to retain in dataDir

#autopurge.snapRetainCount=3

# Purge task interval in hours

# Set to "0" to disable auto purgefeature

#autopurge.purgeInterval=1

server.3=node3:2888:3888

server.4=node4:2888:3888

server.5=node5:2888:3888

然后分别在node3，node4，node5的/app/data/zookeeper/data/目录下创建myid文件，文件内容分别为3,4,5

然后我们把zookeeper的启动脚本添加到环境变量里边，添加环境变量在这里就不啰嗦了。

OK。到此zookeeper安装完成，下面我们启动zookeeper集群

分别在node3，node4，node5上启动zookeeper

/app/opt/zookeeper/zookeeper-3.4.8//bin/zkServer.shstart 启动

/app/opt/zookeeper/zookeeper-3.4.8//bin/zkServer.shstop 停止

/app/opt/zookeeper/zookeeper-3.4.8//bin/zkServer.shstatus 查兰状态

至此，zookeeper集群就已经OK

2.4.4.4 Hadoop格式化启动

进入五台服务器的/app/opt/hadoop/hadoop-2.7.3/sbin目录下

1 在node1上执行

hdfszkfc -formatZK 使hadoop和zk集群发生关联

发生关系成功，接下来进行下一步

2 分别在node3，node4，node5上执行hadoop-daemon.sh start journalnode

用来启动journalnode服务

3 格式化namenode、启动namenode

在node1服务器hadoop主服务器的bin下执行（我们用node作为主节点，node2作为备用节点）

hdfsnamenode –format

格式化成功，进入下一步

在sbin下启动namenode

hadoop-daemon.sh start namenode

在node2上执行bin/hdfsnamenode –bootstrapStandby

然后在node2上执行hadoop-daemon.sh start namenode

此时node备用的namenode已经启动

4启动datanode

在node1上执行hadoop-daemons.sh start datanode

Datanode会在node2，node3，node4，node5上自动启动

4 启动ZKFC

在node1，node2上分别启动hadoop-daemon.sh start zkfc

至此namenode的高可用已经完全启动，下面我们就行验证

2.4.5 namenode高可用验证

1 首先通过浏览器访问192.168.21.161:50070

从上图可以看出node1是活跃状态，下边看看node2

从上图可以看到node2是备用状态，那么现在我们杀死node1的namenode进程，在查看node1已经无法访问了

那么我们看看node2

OK，这说明node1出问题时，node2会自动转化为活跃状态，那么我们再次启动node1，看看node1是不是备用状态，如果是，那么就OK了

到这里namenode高可用验证成功，就OK了

2.4.6 resourcemanage高可用

在node1上执行start-yarn.sh，效果如下

大数据技术专区

大数据从业者之家,一起探索大数据的无限可能！

更多推荐

SQL：数据去重的三种方法

SQL中去除重复数据

大数据技术专区

Hadoop——Windows系统下Hadoop单机环境搭建

大数据技术专区

Python大数据之PySpark(七)SparkCore案例

大数据技术专区

所有评论(0)

查看更多评论

dingyanming

@dingyanming

已为社区贡献1条内容