org.apache.flink.util.FlinkException: JobMaster for job failed,Flink集群无法启动
org.apache.flink.util.FlinkException: JobMaster for job failed,Flink集群无法启动某天集群Zookeeper因为OOM挂掉,引起基于此ZK作HA的Flink集群出现故障。在Zookeeper恢复后,重启Flink集群(版本1.7.2)时出现如下报错,导致集群无法正常启动。ERROR org.apache.flink.runtime.
·
org.apache.flink.util.FlinkException: JobMaster for job failed,Flink集群无法启动
某天集群Zookeeper因为OOM挂掉,引起基于此ZK作HA的Flink集群出现故障。
在Zookeeper恢复后,重启Flink集群(版本1.7.2)时出现如下报错,导致集群无法正常启动。
ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
org.apache.flink.util.FlinkException: JobMaster for job 86f3b54aba572a495740abbcff328c81 failed.
at org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:759)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$startJobManagerRunner$6(Dispatcher.java:339)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.util.FlinkException: Could not start the job manager.
at org.apache.flink.runtime.jobmaster.JobManagerRunner.lambda$verifyJobSchedulingStatusAndStartJobManager$2(JobManagerRunner.java:340)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_3#-582787431]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.UnfencedMessage".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
... 1 more
可以看到是Flink集群重启时,同时重启任务,但是这个任务启动对应的JobMaster时失败了。怀疑ZK挂掉后Flink停机,但是没法正常处理ZK上的元数据。Flink集群重启后,任务状态依然维持在停机之前,但是JobMaster的信息发生变化,任务无法启动引起整个集群故障。
最后手动修改问题任务的在ZK上的状态,将RUNNING状态改为DONE,让有问题的任务不重启。
[zk: zk01:2181(CONNECTED) 14] ls /flink/ha
[jobgraphs, leader, checkpoints, leaderlatch, checkpoint-counter, running_job_registry]
# 修改有问题的任务状态
[zk: zk01:2181(CONNECTED) 15] set /flink/ha/running_job_registry/86f3b54aba572a495740abbcff328c81 DONE
再次重启,如果还有其他任务状态异常也需要进行相同操作。
zk上jobgraphs是任务的解析的jobgraph,leader下面是任务的JobMaster信息,leaderlatch是选举主JobManager的,checkpoint-counter是ck信息,running_job_registry是在运行任务的注册信息。
更多推荐
已为社区贡献1条内容
所有评论(0)