一、Hadoop相关

1.运行wordcount例子

错误信息:

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

14/10/11 17:19:49 INFO mapreduce.Job: Task Id : attempt_1413018622391_0001_r_000002_2, Status : FAILED
Container [pid=11406,containerID=container_1413018622391_0001_01_000023] is running beyond virtual memory limits. Current usage: 178.8 MB of 1 GB physical memory used; 5.7 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1413018622391_0001_01_000023 :
	|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
	|- 11406 10814 11406 11406 (bash) 0 0 108642304 302 /bin/bash -c /opt/java/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN  -Xmx5000m -Djava.io.tmpdir=/home/test/hadoop/data/tmp/nm-local-dir/usercache/test/appcache/application_1413018622391_0001/container_1413018622391_0001_01_000023/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/test/hadoop/app/hadoop-2.5.1/logs/userlogs/application_1413018622391_0001/container_1413018622391_0001_01_000023 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 172.16.2.193 60022 attempt_1413018622391_0001_r_000002_2 23 1>/home/test/hadoop/app/hadoop-2.5.1/logs/userlogs/application_1413018622391_0001/container_1413018622391_0001_01_000023/stdout 2>/home/test/hadoop/app/hadoop-2.5.1/logs/userlogs/application_1413018622391_0001/container_1413018622391_0001_01_000023/stderr  
	|- 11411 11406 11406 11406 (java) 367 12 6029000704 45458 /opt/java/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx5000m -Djava.io.tmpdir=/home/test/hadoop/data/tmp/nm-local-dir/usercache/test/appcache/application_1413018622391_0001/container_1413018622391_0001_01_000023/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/test/hadoop/app/hadoop-2.5.1/logs/userlogs/application_1413018622391_0001/container_1413018622391_0001_01_000023 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 172.16.2.193 60022 attempt_1413018622391_0001_r_000002_2 23

分析及解决:

这种问题的出现和内存有关,可以检查mapred-site.xml中的配置文件,我是将以下几个配置项均设置小一点就行了。具体的大小,和集群的内存大小以及map和red设置的数量相关。
<name>mapreduce.job.maps</name>
<name>mapreduce.job.reduces</name>
<name>mapreduce.tasktracker.map.tasks.maximum</name>
<name>mapreduce.tasktracker.reduce.tasks.maximum</name>
<name>mapred.child.java.opts</name>

2.Hadoop HA模式下,启动第二个namenode,不报错,但是进程没有启动。

错误信息:


分析及解决:

第二个namenode无法启动的原因是第一个namenode的内容没有同步过来,这时候可以通过在第二个namenode上执行hdfs namenode –bootstrapStandby来同步,如果仍然不成功,可以直接把第一个namenode上的name文件夹下的内容复制到第二个namenode的name文件夹下。然后再启动,按照这两种方式这样做,一般都能顺利启动的。

3.Datanode启动失败

错误信息:

log日志中报错如下:
java.io.IOException: Incompatible clusterIDs in /home/hadoop/hadoop-2.5.1/data: namenode clusterID = CID-dedc085d-ec1c-4b07-89cb-5924880d2682; datanode clusterID = CID-ff0e415d-0734-4ae5-b828-6af6c6843ec4
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:477)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:226)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:254)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:975)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:946)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:278)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:220)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:812)
        at java.lang.Thread.run(Thread.java:745)

分析及解决:

    这种情况基本上是因为在第一次执行namenode格式化后,又格式化一次namenode造成的,因此会提示clusterID的问题。
    如果是自己的集群测试使用,直接把name目录和data目录删掉,重新格式化集群即可,如果是实际生产环境,暂不知道解决方案。

4.执行hadoop指令长时间无反应

错误信息:

[test@x197 hadoop]$  hdfs dfs -chmod  755 /
14/10/25 14:53:43 WARN retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setPermission over /172.16.2.197:9000. Not retrying because retries (11) exceeded maximum allowed (10)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RetriableException): org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot set permission for /. Name node is in safe mode.
The reported blocks 414 needs additional 3 blocks to reach the threshold 0.9990 of total blocks 417.
The number of live datanodes 40 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1276)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setPermissionInt(FSNamesystem.java:1624)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setPermission(FSNamesystem.java:1607)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setPermission(NameNodeRpcServer.java:579)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setPermission(ClientNamenodeProtocolServerSideTranslatorPB.java:416)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot set permission for /. Name node is in safe mode.
The reported blocks 414 needs additional 3 blocks to reach the threshold 0.9990 of total blocks 417.
The number of live datanodes 40 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1272)
	... 13 more

	at org.apache.hadoop.ipc.Client.call(Client.java:1411)
	at org.apache.hadoop.ipc.Client.call(Client.java:1364)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy14.setPermission(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setPermission(ClientNamenodeProtocolTranslatorPB.java:314)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:601)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy15.setPermission(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.setPermission(DFSClient.java:2163)
	at org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1236)
	at org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1232)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(DistributedFileSystem.java:1232)
	at org.apache.hadoop.fs.FsShellPermissions$Chmod.processPath(FsShellPermissions.java:103)
	at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:306)
	at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:278)
	at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260)
	at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244)
	at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
	at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
	at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
chmod: changing permissions of '/': org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot set permission for /. Name node is in safe mode.

分析及解决:

出现这种情况主要是我之前改过一次配置文件,然后再启动的时候会进入safemode,这种情况偶尔出现,一般改文件也不会进入safamode。解除这种模式即可。
hadoop dfsadmin -safemode leave

二、Zookeeper相关

1.查看Zookeeper状态:zkServer.sh status

错误信息:

JMX enabled by default
Using config: /home/test/hadoop/app/zookeeper-3.4.6/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.

分析及解决:

如果查看日志的话,会显示这是connection错误,可以理解为几台机器链接不成功。
连接问题的话一般有三种:一、防火墙,2、ssh,三、配置问题,也就是/data/myid里面的数字
之前出过类似问题的时候,是最后一种情况,这次三种情况全部检查均未解决,最后将三台zookeeper中出错误的一台换掉,换成另一台服务器就OK 了,具体原因不清楚。

三、Hbase相关相关

1.hbase启动的时候Hregionserver无法启动

错误信息:

ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2489)

分析及解决:

这个是时间不一致造成的,在root权限下将所有主机的时间设为一致。
date -s "2014-10-13 09:56:33"

2.hbase启动的时候Hre启动后,自动退出

错误信息:

2014-10-11 19:51:37,875 FATAL [regionserver60020] regionserver.HRegionServer: ABORTING region server x196,60020,1413028275179: Initialization of RS failed.  Hence aborting RS.
java.io.IOException: java.lang.reflect.InvocationTargetException
        at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:411)
        at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:388)
        at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:269)
        at org.apache.hadoop.hbase.catalog.CatalogTracker.<init>(CatalogTracker.java:151)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:752)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:715)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:848)
        at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
        at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:409)
        ... 7 more
Caused by: java.lang.ExceptionInInitializerError
        at org.apache.hadoop.hbase.ClusterId.parseFrom(ClusterId.java:64)
        at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:69)
        at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:837)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:640)
        ... 12 more
Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: cluster
        at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418)
        at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:231)
        at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:139)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:510)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:453)
        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:136)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
        at org.apache.hadoop.hbase.util.DynamicClassLoader.<init>(DynamicClassLoader.java:104)
        at org.apache.hadoop.hbase.protobuf.ProtobufUtil.<clinit>(ProtobufUtil.java:201)
        ... 17 more
Caused by: java.net.UnknownHostException: cluster
        ... 31 more
2014-10-11 19:51:37,880 FATAL [regionserver60020] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
2014-10-11 19:51:37,881 INFO  [regionserver60020] regionserver.HRegionServer: STOPPED: Initialization of RS failed.  Hence aborting RS.

分析及解决:

具体原因尚不清楚,之前解决过一起,现在忘记怎么解决了,目前没有找到修改什么地方能保证不自动退出。
目前的解决方法:单独在每台Hre上 启动
hbase-daemon.sh start regionserver

3.在hbase shell中执行list的时候出现警告

错误信息:

hbase(main):001:0> list
TABLE                                                                                           
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/test/hadoop/app/hbase-0.98.6.1-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/test/hadoop/app/hadoop-2.5.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

分析及解决:

删除hbase根目录下的:slf4j-api-1.6.4.jar文件即可
Logo

权威|前沿|技术|干货|国内首个API全生命周期开发者社区

更多推荐