问题来源:

博主所在工作集群中经常遇到k8s的deploy和job中存在terminating任务的现场,顺藤摸瓜发现造成terminating的原因是pod所在节点(ubuntu16.04.6)的容器中有进程未杀掉导致;该进程为D进程,难以处理。

pod所在节点日志有以下特征:

1、大量OOM记录

2、syslog(dmesg亦如此)频繁SLUB(后经网络游历该日志虽为系统bug,非此篇文章描述问题的起源。)

SLUB: Unable to allocate memory on node -1 (gfp=0x2080020)

3、docker的日志则是:

Aug 26 21:12:22 n002 dockerd[1632]: time="2020-08-26T21:12:22.358239959+08:00" level=info msg="Container b39ef98d452cd825cd6ab4e07767b5e8091d055e75e9a7b96ba83ba9c4ac2089 failed to exit within 30 seconds of signal 15 - using the force"
evel=info msg="Container b39ef98d452c failed to exit within 10 seconds of kill - trying direct SIGKILL"

4、dmesg中大量nfs retry日志:

kernel: [1032289.079654] nfs: server 10.32.0.10 not responding, still trying
kernel: [1032289.079664] nfs: server 10.32.0.10 not responding, still trying
kernel: [1032289.151627] nfs: server 10.32.0.10 not responding, still trying

 

2020年12月13日 01:00:50增加信息,先睡了,日后补充,有问题交流。:

https://k8s.imroc.io/avoid/handle-cgroup-oom-in-userspace-with-oom-guard/

https://k8s.imroc.io/troubleshooting/pod/slow-terminating/

https://www.cnblogs.com/jmliao/p/11322804.html

 

 

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐