Flink在Kubernetes环境中taskmanager无法连接jobmanager的错误处理
近期在K8S环境下部署了一套Flink框架,单个Job运行时部署一切正常,多个Job同时运行,系统就会报资源申请不到之类的错误,如:java.io.IOException: Failed to fetch BLOB 04fe83f2b1ff5a167fe4e6c321226dc6/p-6521abaef7d048ff4aed8ce5d585b10ccdd308b-2a56a314e6d42d012
近期在K8S环境下部署了一套Flink框架,单个Job运行时部署一切正常,多个Job同时运行,系统就会报资源申请不到之类的错误,
如:
java.io.IOException: Failed to fetch BLOB 04fe83f2b1ff5a167fe4e6c321226dc6/p-6521abaef7d048ff4aed8ce5d585b10ccdd308b-2a56a314e6d42d012caad6e3d38b0a85 from flink-jobmanager/xxx.xxx.xxx.xxx:6124 and store it under /tmp/blobStore-2cc077e6-6dee-4180-bc70-e110a5cf6c24/incoming/temp-00000350
at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:167)
at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:166)
at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:187)
at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.createUserCodeClassLoader(BlobLibraryCacheManager.java:251)
at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:228)
at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:199)
at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:333)
at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:983)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:632)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Could not connect to BlobServer at address flink-jobmanager/xxx.xxx.xxx.xxx:6124
at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:102)
at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:137)
... 10 more
Caused by: java.net.UnknownHostException: flink-jobmanager
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:96)
... 11 more
一开始以为是k8s的service问题,尝试删除service重新发布,可以临时解决问题
后来随着job的增多,问题依旧。其实问题出在了K8S自身的网络稳定性问题上。尝试在config文件中添加
jobmanager.rpc.address: <clusterIP>
同时加大taskmanager的内存配置
重新部署jobmanager和taskmanager,问题解决
更多推荐
所有评论(0)