近期在K8S环境下部署了一套Flink框架,单个Job运行时部署一切正常,多个Job同时运行,系统就会报资源申请不到之类的错误,

如:

java.io.IOException: Failed to fetch BLOB 04fe83f2b1ff5a167fe4e6c321226dc6/p-6521abaef7d048ff4aed8ce5d585b10ccdd308b-2a56a314e6d42d012caad6e3d38b0a85 from flink-jobmanager/xxx.xxx.xxx.xxx:6124 and store it under /tmp/blobStore-2cc077e6-6dee-4180-bc70-e110a5cf6c24/incoming/temp-00000350
    at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:167)
    at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:166)
    at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:187)
    at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.createUserCodeClassLoader(BlobLibraryCacheManager.java:251)
    at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:228)
    at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:199)
    at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:333)
    at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:983)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:632)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Could not connect to BlobServer at address flink-jobmanager/xxx.xxx.xxx.xxx:6124
    at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:102)
    at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:137)
    ... 10 more
Caused by: java.net.UnknownHostException: flink-jobmanager
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:607)
    at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:96)
    ... 11 more
 

一开始以为是k8s的service问题,尝试删除service重新发布,可以临时解决问题

后来随着job的增多,问题依旧。其实问题出在了K8S自身的网络稳定性问题上。尝试在config文件中添加

jobmanager.rpc.address: <clusterIP>

同时加大taskmanager的内存配置

重新部署jobmanager和taskmanager,问题解决

Logo

开源、云原生的融合云平台

更多推荐