解决方法:升级系统内核

经过测试触发OOM问题

测试:3.10.0-862.3.2.el7.x86_64(内核)

开启7个异常会触发OOM的节点,在一个NODE上,经过测试发现,3.10内核,是并行创建了7个任务,同时触发oom,导致内核锁耗死。测试 2-3分钟内,服务器会死掉,模拟测试连续触发OOM问题直到CPU耗尽。服务器自动重启

kernel: BUG: soft lockup - CPU#4 stuck for 22s! [handler20:1542] 此类也是3.10内核BUG

Nov  6 10:42:55 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
Nov  6 10:42:55 GFS-6 kernel: runc:[1:CHILD] cpuset=c156bcb333882b0a8de413c6e7cbe73867d388dc63d99c7b72d926aa6e669b6a mems_allowed=0
Nov  6 10:43:02 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
Nov  6 10:43:02 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
Nov  6 10:43:03 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
Nov  6 10:43:03 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
Nov  6 10:43:07 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
Nov  6 10:43:07 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
Nov  6 10:43:08 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
Nov  6 10:43:08 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
Nov  6 10:43:09 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
Nov  6 10:43:09 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
Nov  6 10:43:11 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
Nov  6 10:43:11 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
Nov  6 10:43:58 GFS-6 kernel: Initializing cgroup subsys cpuset
Nov  6 10:43:58 GFS-6 kernel: Initializing cgroup subsys cpu
Nov  6 10:43:58 GFS-6 kernel: Initializing cgroup subsys cpuacct
Nov  6 10:43:58 GFS-6 kernel: setup_percpu: NR_CPUS:5120 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
Nov  6 10:43:58 GFS-6 kernel: PERCPU: Embedded 35 pages/cpu @ffff96fa7fc00000 s104856 r8192 d30312 u262144
Nov  6 10:43:58 GFS-6 kernel: #011RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=8.
Nov  6 10:43:58 GFS-6 kernel: core: CPUID marked event: 'cpu cycles' unavailable
Nov  6 10:43:58 GFS-6 kernel: NMI watchdog: disabled (cpu0): hardware events not enabled
Nov  6 10:43:58 GFS-6 kernel: NMI watchdog: Shutting down hard lockup detector on all cpus    <<--cpu全挂了 服务器异常自动重启
Nov  6 10:46:02 GFS-6 systemd: Started Docker Application Container Engine.    << --重启。。                            
Nov  6 10:46:02 GFS-6 systemd: Reached target Multi-User System.
Nov  6 10:46:02 GFS-6 systemd: Starting Multi-User System.
Nov  6 10:46:02 GFS-6 systemd: Starting Update UTMP about System Runlevel Changes...
Nov  6 10:46:02 GFS-6 systemd: Started Update UTMP about System Runlevel Changes.
Nov  6 10:46:02 GFS-6 systemd: Startup finished in 1.456s (kernel) + 4.661s (initrd) + 9.786s (userspace) = 15.904s.
Nov  6 10:46:05 GFS-6 systemd: kubelet.service holdoff time over, scheduling restart.
Nov  6 10:46:05 GFS-6 systemd: Starting kubelet: The Kubernetes Node Agent...
Nov  6 10:46:05 GFS-6 systemd: Started kubelet: The Kubernetes Node Agent.

 k8s已经无法管理node节点 ,node节点pod节点全挂了

[root@k8s-m1 test]# kubectl get po -o wide --all-namespaces |grep k8snode6
default                            ngx-pod-6f977cf846-7k4vm                              0/1       ContainerCreating   0          2m        <none>           k8snode6
default                            ngx-pod-6f977cf846-85mtx                              0/1       ContainerCreating   0          2m        <none>           k8snode6
default                            ngx-pod-6f977cf846-hsf6x                              0/1       ContainerCreating   0          2m        <none>           k8snode6
default                            ngx-pod-6f977cf846-lt68h                              0/1       ContainerCreating   0          2m        <none>           k8snode6
default                            ngx-pod-6f977cf846-mqvcf                              0/1       ContainerCreating   0          2m        <none>           k8snode6
default                            ngx-pod-6f977cf846-rmxzj                              0/1       ContainerCreating   0          2m        <none>           k8snode6
default                            ngx-pod-6f977cf846-sgvrd                              0/1       ContainerCreating   0          2m        <none>           k8snode6
kube-system                        kube-proxy-9mtnw                                      0/1       Error               3          125d      10.80.136.179    k8snode6
monitoring                         kube-prometheus-node-exporter-xbf9k                   0/1       Error               1          63d       10.80.136.179    k8snode6

调整内核 4.1.19,测试触发OOM问题

开启7个异常会触发OOM的节点,在一个NODE上

测试:4.19.1-1.el7.elrepo.x86_64(内核) 

测试发现,4.19内核创建任务,非并向,暂时无法触发内核锁BUG。

[root@k8snode7-180v136-taiji ~]# tail -f /var/log/messages|grep oom_kill
Nov  6 11:32:58 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:32:59 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:00 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:01 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:02 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:02 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:03 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:03 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:03 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:04 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:05 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:06 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:07 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:08 GFS-7 kernel: oom_kill_process+0x262/0x290
......................
[root@k8s-m1 test]# kubectl get po --all-namespaces -o wide |grep k8snode7
default                            ngx-pod-74c88d6495-79krh                              0/1       ContainerCreating   0          33m       <none>           k8snode7
kube-system                        kube-proxy-xt4c7                                      1/1       Running             1          55d       10.80.136.180    k8snode7
monitoring                         kube-prometheus-node-exporter-bbsjn                   1/1       Running             1          60d       10.80.136.180    k8snode7

总结:暂时灰度部分服务器升级内核到4.1.19。后续补充

升级内核操作

1.源
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
2、列出可用的系统内核相关包
yum --disablerepo="*" --enablerepo="elrepo-kernel" list available
3、安装最新的主线稳定内核
yum --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel -y
查看默认启动顺序
awk -F\' '$1=="menuentry " {print $2}' /etc/grub2.cfg
默认启动的顺序是从0开始,但我们新内核是从头插入(目前位置在1,而4.0.2的是在0),所以需要选择0,如果想生效最新的内核,需要
 
grub2-set-default 0
grub2-mkconfig -o /boot/grub2/grub.cfg
cat /boot/grub2/grub.cfg
yum remove kernel-3.10.0-327.el7.x86_64 kernel-devel-3.10.0-327.el7.x86_64 -y

自定义内核

下面链接可以下载到其他归档版本的

下面是ml的内核和上面归档内核版本任选其一的安装方法

自选版本内核安装方法

export Kernel_Version=4.18.9-1   
4.20.13-1
wget http://mirror.rc.usf.edu/compute_lock/elrepo/kernel/el7/x86_64/RPMS/kernel-ml{,-devel}-${Kernel_Version}.el7.elrepo.x86_64.rpm
 
wget http://mirror.rc.usf.edu/compute_lock/elrepo/kernel/el7/x86_64/RPMS/kernel-ml{,-devel}-${4.20.13-1}.el7.elrepo.x86_64.rpm
yum localinstall -y kernel-ml*​

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐