今天按照之前《Hadoop2.6.0 + zookeeper集群环境搭建 》一文重新搭建了Hadoop2.7.2+zookeeper的HA,实现namenode挂掉后可以自动切换,总体来说还算比较顺利。搭建完成后一切正常!但是!第二天重新启动集群的时候出现问题:两个namenode有一个始终启动不了!,具体问题描述如下:


问题描述

HA按照规划配置好,启动后,NameNode不能正常启动。刚启动的时候 jps 看到了NameNode,但是隔了一两分钟,再看NameNode就不见了。查看日志发现以下报错信息:

  org.apache.hadoop.ipc.Client:Retrying connect to server

 

但是测试之后,发现下面2种情况:

  • 先启动JournalNode,再启动HdfsNameNode可以启动并可以正常运行
  • 使用start-dfs.sh启动,众多服务都启动了,隔两分钟NameNode会退出,再次hadoop-daemon.sh start namenode单独启动可以成功稳定运行NameNode

 

再看NameNode的日志,不要嫌日志长,其实出错的蛛丝马迹都包含其中了,如下:

计算机生成了可选文字:|02016-09-03 org.apache.hadoop.util.GSet: VM type 2016-09-03 2016-09-03 2016-09-03 dename 6115@hadoop1 1. 135:8485. 1. 134:8485. 1. 131:8485. 1. 132 :8485. 1. 133:8485. 1. 131:8485. 1. 134:8485. 1. 132 :8485. 1. 135:8485. 1. 133:8485. 1. 135:8485. 249 INFO INFO INFO INFO INFO = 64-bit org . apache . hadoop . util . GSet: 0.029999999329447746% max memory 966.7 MB = 297.0 KB org.apache.hadoop.util.GSet: capacity = 2A15 = 32768 entries org.apache.hadoop.hdfs.server.common.Storage: Lock on /opt/hadoop/tmp/dfs/name/in_use.lock acquired by no 2016-09-03 org.apache.hadoop.ipc . Client: Retrying connect to server: hadoop5/192.168. 2016- 2016- 2016- 2016- 2016- 2016- 2016- 2016- 2016- 2016- retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, slee Time=1000 MILLISECONDS 09-03 00:58:24, 268 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop4/192.168. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 09-03 00:58:24, 269 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop1/192.168. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 09-03 00:58:24, 269 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop2/192.168. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 09-03 00:58:24, 269 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop3/192.168. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 09-03 00:58:25, 270 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop1/192.168. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 09-03 00:58:25, 270 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop4/192.168. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 09-03 00:58:25, 271 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop2/192.168. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 09-03 00:58:25, 272 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop5/192.168. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 09-03 00:58:25, 272 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop3/192.168. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 09-03 00:58:26, 274 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop5/192.168. ret ry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) Al ready Al ready Al ready Al ready Al ready Al ready Al ready Al ready Al ready Al ready Al ready tried tried tried tried tried tried tried tried tried tried tried 0 0 0 0 0 1 1 1 1 1 2 tim tim tim tim tim tim tim tim tim tim tim

 

计算机生成了可选文字:|02016- 09-03 00:58: 33,283 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop1/192.168. 1. 131:8485. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016- 09-03 00:58:33, 293 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop5/192.168. 1. 135:8485. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016- 09-03 00:58:33, 294 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop4/192.168. 1. 134:8485. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016- 09-03 00:58:33, 294 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop2/192.168. 1. 132 :8485. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016- 09-03 00:58:33, 294 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop3/192.168. 1. 133:8485. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016- org . apache . hadoop . hdfs . qj ou rnal . client . QuorumCall . reth rowException (QuorumCall . Al ready Al ready Al ready Al ready Al ready t ried t ried t ried t ried t ried 9 9 9 9 9 tim tim tim tim tim 09-03 00:58:33, 298 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input 1.131:8485, 192.168.1.132:8485, 192.168.1.133:8485, 192.168.1.134:8485, 192.168.1.135:8485]. Skipping. streams from QJM to [192.168. org . apache . hadoop . hdfs . qj ou rnal . client . QuorumException : Got too many exceptions to achieve quorum size 3/5. 4 exceptions thrown: ion ion ion ion 192.168.1.131:8485: Call From refused; For more details 192.168.1.133:8485: Call From refused; For more details 192.168.1.134:8485: Call From refused; For more details 192.168.1.132:8485: Call From hadoop1/192.168.1.131 to hadoop1:8485 failed on connection http : //wiki . apache . o rg/hadoop/ConnectionRefused hadoop1/192.168.1.131 to hadoop3:8485 failed on connection http : //wiki . apache . o rg/hadoop/ConnectionRefused hadoop1/192.168.1.131 to hadoop4:8485 failed on connection http : //wiki . apache . o rg/hadoop/ConnectionRefused hadoop1/192.168.1.131 to hadoop2 :8485 failed on connection http : //wiki . apache . o rg/hadoop/ConnectionRefused exception : exception : exception : exception : j ava . net . ConnectException : j ava . net . ConnectException : j ava . net . ConnectException : j ava . net . ConnectException : Connect Connect Connect Connect refused; For more details at at at at •at at org . apache . hadoop . hdfs . qj ou rnal . client . QuorumException . c reate (QuorumException . j ava : 81) java:223) org . apache . hadoop . hdfs . qj ou rnal . client . AsyncLogge rSet . waitFo rWriteQuo rum (AsyncLogge rSet . j ava : 142 ) org . apache . hadoop . hdfs . qj ou rnal . client . Quo rumJou rnalManager . select InputSt reams (Quo rumJou rnalManager . j ava : 471 ) org . apache . hadoop . hdfs . se rver . namenode . Jou rnalSet . select InputSt reams (Jou rnalSet . j ava : 278) org . apache . hadoop . hdfs . se rver . namenode . FSEditLog . select InputSt reams ( FSEditLog . j ava : 1508) 2220, 1-8

 

计算机生成了可选文字:|02016- 09-03 00:58: 55,288 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop4/192.168. 1. 134:8485. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016- 09-03 00:58:55, 289 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop2/192.168. 1. 132 :8485. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016- 09-03 00:58:55, 289 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop5/192.168. 1. 135:8485. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016- 09-03 00:58:55, 289 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop3/192.168. 1. 133:8485. retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016- org . apache . hadoop . hdfs . qj ou rnal . client . QuorumCall . reth rowException (QuorumCall . Al ready Al ready Al ready Al ready t ried t ried t ried t ried 9 9 9 9 tim tim tim tim 09-03 00:58:55, 291 FATAL orq.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal to [192.168.1.131:8485, 192.168.1.132:8485, 192.168.1.133:8485, 192.168.1.134:8485, 192.168.1.135:84 85] , st ream=null)) org . apache . hadoop . hdfs . qj ou rnal . client . QuorumException : ion ion ion ion Got too many exceptions to achieve quorum size 3/5. 4 exceptions thrown: 192.168.1.134:8485: Call From refused; For more details 192.168.1.133:8485: Call From refused; For more details 192.168.1.131:8485: Call From refused; For more details 192.168.1.135:8485: Call From hadoop1/192.168.1.131 to hadoop4:8485 failed on connection http : //wiki . apache . o rg/hadoop/ConnectionRefused hadoop1/192.168.1.131 to hadoop3:8485 failed on connection http : //wiki . apache . o rg/hadoop/ConnectionRefused hadoop1/192.168.1.131 to hadoop1:8485 failed on connection http : //wiki . apache . o rg/hadoop/ConnectionRefused hadoop1/192.168.1.131 to hadoop5:8485 failed on connection http : //wiki . apache . o rg/hadoop/ConnectionRefused exception : exception : exception : exception : java. net. onnectException: j ava . net . ConnectException : j ava . net . ConnectException : j ava . net . ConnectException : Connect Connect Connect Connect refused; For more details at at at at at at •at org . apache . hadoop . hdfs . qj ou rnal . client . QuorumException . c reate (QuorumException . j ava : 81) java:223) org . apache . hadoop . hdfs . qj ou rnal . client . AsyncLogge rSet . waitFo rWriteQuo rum (AsyncLogge rSet . j ava : 142 ) org . apache . hadoop . hdfs . qj ou rnal . client . Quo rumJou rnalManager . c reateNewUniqueEpoch (Quo rumJou rnalManager . j ava : 182 ) org . apache . hadoop . hdfs . qj ou rnal . client . Quo rumJou rnalManager . recove rUnfinalizedSegments (Quo rumJou rnalManager . j ava : 436) Jou rnalSet$8. apply(Jou rnalSet . j ava : 624) org . apache . hadoop . hdfs . server . namenode . org . apache . hadoop . hdfs . se rver . namenode . Jou rnalSet . mapJou rnalsAndRepo rtEr rors (Jou rnalSet . j ava : 393) 2405, 1-8

 

 

问题分析

看着日志很长,来分析一下,注意看日志中使用颜色突出的部分。

可以肯定NameNode不能正常运行,不是配置错了,而是不能连接上JournalNode

查看JournalNode的日志没有问题,那么问题就在JournalNode的客户端NameNode

 

2016-09-0300:58:46,256 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop2/192.168.1.132:8485. Already tried 0 time(s);retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000MILLISECONDS)

 

来分析上句的日志:

NameNode作为JournalNode的客户端发起连接请求,但是失败了,然后NameNode又向其他节点依次发起了请求都失败了,直至到了最大重试次数。

 

通过实验知道,先启动JournalNode或者再次启动NameNode就可以了,说明JournalNode并没有准备好,而NameNode已经用完了所有重试次数。

 

 

解决办法

 

修改core-site.xml中的ipc参数

  <property>

   <name>ipc.client.connect.max.retries</name>

    <value>20</value>

    <description>

      Indicates the number of retries a clientwill make to establisha server connection.

    </description>

  </property>

 

  <property>

   <name>ipc.client.connect.retry.interval</name>

    <value>5000</value>

    <description>

      Indicates the number of milliseconds aclient will wait for before retrying to establish a server connection.

    </description>

  </property>

 

计算机生成了可选文字:|0rope rtY> hadoop . tmp . dir </name> <value>/opt/hadoop/tmp </value> rope rtY> rope rtY> ha . zookeeper . quorum </name> : 2181 , hadoop2 : 2181 , hadoop3 : 2181 , hadoop4 : 2181 , hadoop5 : 2181 </value> rope rtY> rope rtY> ipc . client . connect . max . ret ries </name> <value>20</value> <description> Indicates the number of retries a client will make to establisha server connection. </desc ription> rope rtY> rope rtY> ipc . client . connect . retry. interval </name> 5000 </value> <description> Indicates the number of milliseconds a client will wait for before retrying to establish a server connection. </desc ription> rope rtY> </con Igura lon> - INSERT 53,17 Bot

 

NamenodeJournalNode发起的ipc连接请求的重试间隔时间和重试次数,我的虚拟机集群实验大约需要2分钟,NameNode即可连接上JournalNode。连接后很稳定。

 

注意:仅对于这种由于服务没有启动完成造成连接超时的问题,都可以调整core-site.xml中的ipc参数来解决。如果目标服务本身没有启动成功,这边调整ipc参数是无效的。

 

结果

计算机生成了可选文字:|0Starting hadoopl : hadoop2 : hadoopl : hadoop2 : hadoop5 : hadoop3 : hadoop4 : Starting hadoopl : hadoop3 : hadoop2 : hadoop5 : hadoop4 : Starting hadoopl : hadoop2 : [root@hadoopl -]# /opt/hadoop/sbin/start-dfs.sh 7 .2/10gs/hadoop - root - namenode - hadoopl . out 7 . 2/ logs/hadoop - root - namenode - hadoop2 . out 7 . 2/ logs/hadoop - root - datanode - hadoopl . out 7 . 2/ logs/hadoop - root - datanode - hadoop2 . out 7 .2/10gs/hadoop - root - datanode - hadoop5. out 7 . 2/ logs/hadoop - root - datanode - hadoop3. out 7 . 2/ logs/hadoop - root - datanode - hadoop4. out namenodes on [hadoopl hadoop2] starting namenode, logging to /opt/hadoop starting namenode, logging to /opt/hadoop starting datanode, logging to /opt/hadoop starting datanode, logging to /opt/hadoop starting datanode, logging to /opt/hadoop starting datanode, logging to /opt/hadoop starting datanode, logging to /opt/hadoop journal nodes [hadoopl hadoop2 hadoop3 hadoop4 hadoop5] starting journalnode, logging to /opt/hadoop-2.7.2/logs/hadoop- root-journalnode-hadoopl.out starting journalnode, logging to /opt/hadoop-2.7.2/logs/hadoop- root-journalnode-hadoop3.out starting journalnode, logging to /opt/hadoop-2.7.2/logs/hadoop- root-journalnode-hadoop2. out starting journalnode, logging to /opt/hadoop-2.7.2/logs/hadoop- root-journalnode-hadoop5.out starting journalnode, logging to /opt/hadoop-2.7.2/logs/hadoop- root-journalnode-hadoop4.out ZK Failover Controllers on NN hosts [hadoopl hadoop2] starting zkfc, logging to /opt/hadoop-2.7.2/logs/hadoop- root-zkfc-hadoopl.out starting zkfc, logging to /opt/hadoop-2.7.2/logs/hadoop- root-zkfc-hadoop2. out

 

计算机生成了可选文字:|0[root@hadoopl -]# /opt/hadoop/sbin/start-yarn.sh starting starting hadoop5 : hadoop4 : hadoop2 : hadoop3 : hadoopl : yarn daemons resourcemanager, logging to /opt/hadoop-2.7.2/logs/yarn- root- resourcemanager-hadoopl.out starting starting starting starting starting nodemanager , nodemanager , nodemanager , nodemanager , nodemanager , j ps logging logging logging logging logging to to to to to /opt/hadoop /opt/hadoop /opt/hadoop /opt/hadoop /opt/hadoop 7 . 2/ logs/ ya rn - root - nodemanager - hadoop5. out 7 . 2/ logs/ ya rn - root - nodemanager - hadoop4. out 7 . 2/ logs/ yarn - root - nodemanager - hadoop2 . out 7 . 2/ logs/ ya rn - root - nodemanager - hadoop3. out 7 . 2/ logs/ ya rn - root - nodemanager - hadoopl . out [ root@hadoopl [ root@hadoopl [ root@hadoopl 8929 8817 8164 8470 5926 8970 8604 8287 NodeManage r Resou rceManage r NameNode Jou rnalNode QuorumPee rMain JPS DFSZKFailoverCont roller DataNode

 

可以看到namenode已经正常启动了,并且比较稳定

计算机生成了可选文字:|0Hadoop Overview Datanodes Datanode Volume Failures Snapshot Startup Progress Utilities Overview Namespace: Namenode ID: Started: Version: Compiled: Cluster ID: Block Pool ID: Summary Security is off. Safemode is off. 'hadoopl :9000' (active) yhao nnl Sat sep 03 01 CST 2016 2.7.2, rb165c4fe8a74265c792ce23f546c64604acfOe41 2016-01-26T00:08Z by jenkins from (detached from b165c4f) CID-118ba 1 ac-2df3-430e-9fde-8c73753a6894 BP-277984919-192.168.1.131-1472749072174 2 files and directories, 0 blocks = 2 total filesystem object(s). Heap Memory used 38.88 MB of 65.12 MB Heap Memory. Max Heap Memory is 966.69 MB. Non Heap Memory used 46.51 MB of 47.25 MB Commited Non Heap Memory. Max Non Heap Memory is -1 B.

 

计算机生成了可选文字:|0Hadoop Overview Datanodes Datanode Volume Failures Snapshot Startup Progress Utilities Overview Namespace: Namenode ID: Started: Version: Compiled: Cluster ID: Block Pool ID: Summary Security is off. Safemode is off. 'hadoop2:9000' yhao nn2 (standby) Sat sep 03 01 CST 2016 2.7.2, rb165c4fe8a74265c792ce23f546c64604acfOe41 2016-01-26T00:08Z by jenkins from (detached from b165c4f) CID-118ba 1 ac-2df3-430e-9fde-8c73753a6894 BP-277984919-192.168.1.131-1472749072174 2 files and directories, 0 blocks = 2 total filesystem object(s). Heap Memory used 25.52 MB of 60.31 MB Heap Memory. Max Heap Memory is 966.69 MB. Non Heap Memory used 41.6 MB of 42.48 MB Commited Non Heap Memory. Max Non Heap Memory is -1 B.

 

从界面上也能看到两个namenode都已启动,一个为active状态,一个为standby状态。

Logo

权威|前沿|技术|干货|国内首个API全生命周期开发者社区

更多推荐