Kubernetes集群节点异常检测
文章目录k8s生产集群可能存在的问题node-problem-detector故障分类问题汇报策略上手实践使用插件pod启动NPD常见问题排查ssh到内网节点查看日志附生产配置k8s生产集群可能存在的问题基础架构守护进程问题:例如NTP服务关闭硬件问题:CPU,内存或磁盘损坏内核问题:内核死锁,文件系统损坏容器运行时问题:运行时守护进程无响应…当Kubernetes节点发生上述问题,在整个集群中,
k8s生产集群可能存在的问题
- 基础架构守护进程问题:例如NTP服务关闭
- 硬件问题:CPU,内存或磁盘损坏
- 内核问题:内核死锁,文件系统损坏
- 容器运行时问题:运行时守护进程无响应
- …
当Kubernetes节点发生上述问题,在整个集群中,k8s服务组件并不会感知以上问题,就会导致Pod仍会调度至问题节点,出现业务中断等事件。
node-problem-detector
node-problem-detector可以理解为是一个检测节点的探测器。
为了解决以上问题,社区引入了守护进程node-problem-detector,从各个守护进程收集节点问题,并使它们对上游层可见。
Kubernetes节点诊断的工具,可以探测节点的异常,例如:
• Runtime无响应
• Linux Kernel 无响应
• 网络异常
• 文件描述符异常
• 硬件问题如CPU,内存或者磁盘故障
故障分类
问题汇报策略
node-problem-detector通过设置NodeCondition(更新节点状态)或者创建Event对象来汇报问题
- NodeCondition:针对永久性故障/会通过设置NodeCondition来改变节点状态
- Event:临时故障通过Event来提醒相关对象,比如通知当前节点运行的所有Pod
上手实践
社区项目地址:https://github.com/kubernetes/node-problem-detector
通过helm方式安装
helm repo add deliveryhero https://charts.deliveryhero.io/
helm install deliveryhero/node-problem-detector
Yaml配置清单(原始)
[root@VM-2-29-centos node-problem-detector]# cat npd-ds.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-problem-detector-v0.1
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
selector:
matchLabels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
template:
metadata:
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
hostNetwork: true
containers:
- name: node-problem-detector
image: cncamp/node-problem-detector:v0.8.10
securityContext:
privileged: true
resources:
limits:
cpu: "200m"
memory: "100Mi"
requests:
cpu: "20m"
memory: "20Mi"
volumeMounts:
- name: log
mountPath: /log
readOnly: true
volumes:
- name: log
hostPath:
path: /var/log/
[root@VM-2-29-centos node-problem-detector]# kubectl get node 10.0.2.29 -oyaml
...
conditions:
- lastHeartbeatTime: "2022-02-23T07:36:39Z"
lastTransitionTime: "2022-02-12T08:32:04Z"
message: Containerd service is up
reason: ContainerdIsUp
status: "False"
type: ContainerdProblem
- lastHeartbeatTime: "2022-02-23T07:36:39Z"
lastTransitionTime: "2022-02-12T08:32:04Z"
message: FD is Under Pressure
reason: FDUnderPressure
status: "False"
type: FDPressure
...
探测到节点异常之后,会把异常信息记录到conditions中,只是做了节点状态更新操作,并不会具备自动处理机制,无法形成闭环。对于此场景,可以使用插件Pod启动NPD、对接监控告警系统、开发自定义控制器等。
故障异常演练
通过将消息注入到 node-problem-detector 正在监视的日志中来尝试运行集群中的 node-problem-detector。例如,假设 node-problem-detector 正在使用KernelMonitor。在k8s节点运行kubectl get events -w,运行sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg".
# sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"
可以看到KernelOops事件
使用插件pod启动NPD
如果你使用的是自定义集群引导解决方案,不需要覆盖默认配置,可以利用插件Pod进一步自动化部署。
创建node-strick-detector.yaml,并在控制平面节点上保存配置到插件Pod的目录 /etc/kubernetes/addons/node-problem-detector
NPD的异常处理行为
- NPD只负责获取异常事件/并修改node condition,不会对节点状态和调度产生影响
lastHeartbeatTime: "2021-11-06T15:44:46Z"
lastTransitionTime: "2021 -11-06T15:29:43Z"
message: 'kernel: INFO: task docker:20744 blocked for more than 120 seconds.'
reason: DockerHung
status: "True"
type: KernelDeadlock
- 需要自定义控制器,监听NPD汇报的condition, taint node,阻止pod调度至故障节点
- 问题修复后,重启NPDPod来清理错误事件
常见问题排查
ssh到内网节点
- 创建一个支持ssh的pod
- 并通过负载均衡器转发ssh请求
查看日志
针对使用 systemd 拉起的服务
journalctl -afu kubelet -S "2019-08-26 15:00:00"
-u unit,对应的systemd拉起的组件,如kubelet
-f follow,跟踪最新日志
-a show all,现实所有日之列
-S since,从某一时间开始 -S "2019-08-26 15:00:00"
对于标准的容器日志
kubectl logs -f -c <containername> <podname>
kubectl logs -f --all-containers <podname>
kubectl logs -f -c <podname> --previous
如果容器日志被 shell 转储到文件,则需通过 exec
kubectl exec -it xxx -- tail -f /path/to/log
附生产配置
[root@VM-2-29-centos node-problem-detector]# kubectl get ds -nti-inf node-problem-detector -oyaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
cpaas.io/creator: admin
cpaas.io/updated-at: "2022-02-12T07:58:59Z"
deprecated.daemonset.template.generation: "2"
meta.helm.sh/release-name: node-problem-detector
meta.helm.sh/release-namespace: ti-inf
creationTimestamp: "2022-02-11T19:52:30Z"
generation: 2
labels:
app.kubernetes.io/managed-by: Helm
managedFields:
- apiVersion: apps/v1
manager: Go-http-client
operation: Update
time: "2022-02-11T19:52:30Z"
- apiVersion: apps/v1
manager: kube-controller-manager
operation: Update
time: "2022-02-12T09:20:03Z"
name: node-problem-detector
namespace: ti-inf
resourceVersion: "6040793"
selfLink: /apis/apps/v1/namespaces/ti-inf/daemonsets/node-problem-detector
uid: 07b214b9-9279-465b-a900-eae233928a3e
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: node-problem-detector
template:
metadata:
annotations:
cpaas.io/creator: admin
creationTimestamp: null
labels:
app: node-problem-detector
spec:
containers:
- command:
- /node-problem-detector
- --logtostderr
- --v=3
- --config.zombie-process-monitor=/config/zombie-process-monitor.json #僵尸进程监控
- --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json #系统日志监控
- --config.custom-plugin-monitor=/config/custom-plugin-fd-pressure.json,/config/network-problem-monitor.json,/config/costom-plugin-thread-pressure.json,/config/systemd-monitor-counter.json,/config/custom-plugin-dockerd-monitor.json,/config/custom-plugin-kubelet-monitor.json,/config/custom-plugin-containerd-monitor.json,/config/custom-plugin-pid-pressure.json #自定义插件监视器配置文件
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: xxxxxxxxx/ti-inf/node-problem-detector:v3.6
imagePullPolicy: Always
name: node-problem-detector
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /run
mountPropagation: HostToContainer
name: rundir
- mountPath: /var/log
name: log
readOnly: true
- mountPath: /dev/kmsg
name: kmsg
readOnly: true
- mountPath: /etc/localtime
name: localtime
readOnly: true
- mountPath: /var/run/dbus
name: systemddbus
dnsPolicy: ClusterFirst
hostPID: true
imagePullSecrets:
- name: qcloudregistrykey
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: node-problem-detector
serviceAccountName: node-problem-detector
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /run
type: ""
name: rundir
- hostPath:
path: /var/log/
type: ""
name: log
- hostPath:
path: /dev/kmsg
type: ""
name: kmsg
- hostPath:
path: /etc/localtime
type: ""
name: localtime
- hostPath:
path: /var/run/dbus
type: ""
name: systemddbus
updateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
以僵尸进程为例展现:
更多推荐
所有评论(0)