Kubernetes集群节点异常检测

文章目录k8s生产集群可能存在的问题node-problem-detector故障分类问题汇报策略上手实践使用插件pod启动NPD常见问题排查ssh到内网节点查看日志附生产配置k8s生产集群可能存在的问题基础架构守护进程问题：例如NTP服务关闭硬件问题：CPU，内存或磁盘损坏内核问题：内核死锁，文件系统损坏容器运行时问题：运行时守护进程无响应…当Kubernetes节点发生上述问题，在整个集群中，

寻花之梦~~

1088人浏览 · 2022-02-23 16:35:26

寻花之梦~~ · 2022-02-23 16:35:26 发布

文章目录

k8s生产集群可能存在的问题

基础架构守护进程问题：例如NTP服务关闭
硬件问题：CPU，内存或磁盘损坏
内核问题：内核死锁，文件系统损坏
容器运行时问题：运行时守护进程无响应
…

当Kubernetes节点发生上述问题，在整个集群中，k8s服务组件并不会感知以上问题，就会导致Pod仍会调度至问题节点，出现业务中断等事件。

node-problem-detector

node-problem-detector可以理解为是一个检测节点的探测器。
为了解决以上问题，社区引入了守护进程node-problem-detector，从各个守护进程收集节点问题,并使它们对上游层可见。
Kubernetes节点诊断的工具，可以探测节点的异常，例如：
• Runtime无响应
• Linux Kernel 无响应
• 网络异常
• 文件描述符异常
• 硬件问题如CPU，内存或者磁盘故障

故障分类

在这里插入图片描述

问题汇报策略

node-problem-detector通过设置NodeCondition（更新节点状态）或者创建Event对象来汇报问题

NodeCondition:针对永久性故障/会通过设置NodeCondition来改变节点状态
Event:临时故障通过Event来提醒相关对象，比如通知当前节点运行的所有Pod

上手实践

社区项目地址：https://github.com/kubernetes/node-problem-detector

通过helm方式安装

helm repo add deliveryhero https://charts.deliveryhero.io/
helm install deliveryhero/node-problem-detector

Yaml配置清单(原始)

[root@VM-2-29-centos node-problem-detector]# cat npd-ds.yaml 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector-v0.1
  labels:
    k8s-app: node-problem-detector
    version: v0.1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: node-problem-detector
      version: v0.1
      kubernetes.io/cluster-service: "true"
  template:
    metadata:
      labels:
        k8s-app: node-problem-detector
        version: v0.1
        kubernetes.io/cluster-service: "true"
    spec:
      hostNetwork: true
      containers:
        - name: node-problem-detector
          image: cncamp/node-problem-detector:v0.8.10
          securityContext:
            privileged: true
          resources:
            limits:
              cpu: "200m"
              memory: "100Mi"
            requests:
              cpu: "20m"
              memory: "20Mi"
          volumeMounts:
            - name: log
              mountPath: /log
              readOnly: true
      volumes:
        - name: log
          hostPath:
            path: /var/log/

[root@VM-2-29-centos node-problem-detector]# kubectl get node 10.0.2.29 -oyaml
...
  conditions:
  - lastHeartbeatTime: "2022-02-23T07:36:39Z"
    lastTransitionTime: "2022-02-12T08:32:04Z"
    message: Containerd service is up
    reason: ContainerdIsUp
    status: "False"
    type: ContainerdProblem
  - lastHeartbeatTime: "2022-02-23T07:36:39Z"
    lastTransitionTime: "2022-02-12T08:32:04Z"
    message: FD is Under Pressure
    reason: FDUnderPressure
    status: "False"
    type: FDPressure
...

探测到节点异常之后，会把异常信息记录到conditions中，只是做了节点状态更新操作，并不会具备自动处理机制，无法形成闭环。对于此场景，可以使用插件Pod启动NPD、对接监控告警系统、开发自定义控制器等。

故障异常演练

通过将消息注入到 node-problem-detector 正在监视的日志中来尝试运行集群中的 node-problem-detector。例如，假设 node-problem-detector 正在使用KernelMonitor。在k8s节点运行kubectl get events -w，运行sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg".

# sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"

可以看到KernelOops事件
在这里插入图片描述

使用插件pod启动NPD

如果你使用的是自定义集群引导解决方案，不需要覆盖默认配置，可以利用插件Pod进一步自动化部署。
创建node-strick-detector.yaml，并在控制平面节点上保存配置到插件Pod的目录 /etc/kubernetes/addons/node-problem-detector

NPD的异常处理行为

NPD只负责获取异常事件/并修改node condition,不会对节点状态和调度产生影响

lastHeartbeatTime: "2021-11-06T15:44:46Z" 
lastTransitionTime: "2021 -11-06T15:29:43Z"
message: 'kernel: INFO: task docker:20744 blocked for more than 120 seconds.' 
reason: DockerHung
status: "True"
type: KernelDeadlock

需要自定义控制器,监听NPD汇报的condition, taint node,阻止pod调度至故障节点
问题修复后，重启NPDPod来清理错误事件

常见问题排查

ssh到内网节点

创建一个支持ssh的pod
并通过负载均衡器转发ssh请求

查看日志

针对使用 systemd 拉起的服务

journalctl -afu kubelet -S "2019-08-26 15:00:00"
-u unit，对应的systemd拉起的组件，如kubelet
-f follow，跟踪最新日志
-a show all，现实所有日之列
-S since，从某一时间开始 -S "2019-08-26 15:00:00"

对于标准的容器日志

kubectl logs -f -c <containername> <podname>
kubectl logs -f --all-containers <podname>
kubectl logs -f -c <podname> --previous

如果容器日志被 shell 转储到文件，则需通过 exec

kubectl exec -it xxx -- tail -f /path/to/log

附生产配置

在这里插入图片描述

[root@VM-2-29-centos node-problem-detector]# kubectl get ds -nti-inf node-problem-detector -oyaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    cpaas.io/creator: admin
    cpaas.io/updated-at: "2022-02-12T07:58:59Z"
    deprecated.daemonset.template.generation: "2"
    meta.helm.sh/release-name: node-problem-detector
    meta.helm.sh/release-namespace: ti-inf
  creationTimestamp: "2022-02-11T19:52:30Z"
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
  managedFields:
  - apiVersion: apps/v1
    manager: Go-http-client
    operation: Update
    time: "2022-02-11T19:52:30Z"
  - apiVersion: apps/v1
    manager: kube-controller-manager
    operation: Update
    time: "2022-02-12T09:20:03Z"
  name: node-problem-detector
  namespace: ti-inf
  resourceVersion: "6040793"
  selfLink: /apis/apps/v1/namespaces/ti-inf/daemonsets/node-problem-detector
  uid: 07b214b9-9279-465b-a900-eae233928a3e
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: node-problem-detector
  template:
    metadata:
      annotations:
        cpaas.io/creator: admin
      creationTimestamp: null
      labels:
        app: node-problem-detector
    spec:
      containers:
      - command:
        - /node-problem-detector
        - --logtostderr
        - --v=3
        - --config.zombie-process-monitor=/config/zombie-process-monitor.json	#僵尸进程监控
        - --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json	#系统日志监控
        - --config.custom-plugin-monitor=/config/custom-plugin-fd-pressure.json,/config/network-problem-monitor.json,/config/costom-plugin-thread-pressure.json,/config/systemd-monitor-counter.json,/config/custom-plugin-dockerd-monitor.json,/config/custom-plugin-kubelet-monitor.json,/config/custom-plugin-containerd-monitor.json,/config/custom-plugin-pid-pressure.json	#自定义插件监视器配置文件
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: xxxxxxxxx/ti-inf/node-problem-detector:v3.6
        imagePullPolicy: Always
        name: node-problem-detector
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /run
          mountPropagation: HostToContainer
          name: rundir
        - mountPath: /var/log
          name: log
          readOnly: true
        - mountPath: /dev/kmsg
          name: kmsg
          readOnly: true
        - mountPath: /etc/localtime
          name: localtime
          readOnly: true
        - mountPath: /var/run/dbus
          name: systemddbus
      dnsPolicy: ClusterFirst
      hostPID: true
      imagePullSecrets:
      - name: qcloudregistrykey
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: node-problem-detector
      serviceAccountName: node-problem-detector
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /run
          type: ""
        name: rundir
      - hostPath:
          path: /var/log/
          type: ""
        name: log
      - hostPath:
          path: /dev/kmsg
          type: ""
        name: kmsg
      - hostPath:
          path: /etc/localtime
          type: ""
        name: localtime
      - hostPath:
          path: /var/run/dbus
          type: ""
        name: systemddbus
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate

以僵尸进程为例展现：
在这里插入图片描述

K8S/Kubernetes

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐

【深度】阿里巴巴万级规模 K8s 集群全局高可用体系之美

作者 | 韩堂、柘远、沉醉来源 | 阿里巴巴云原生公众号前言台湾作家林清玄在接受记者采访的时候，如此评价自己 30 多年写作生涯：“第一个十年我才华横溢，‘贼光闪现’，令周边黯然失色；第二个十年，我终于‘宝光现形’，不再去抢风头，反而与身边的美丽相得益彰；进入第三个十年，繁华落尽见真醇，我进入了‘醇光初现’的阶段，真正体味到了境界之美”。长夜有穷，真水无香。领略过了 K8s“身在江

K8S/Kubernetes

如何基于 K8s 构建下一代 DevOps 平台？

作者 | 孙健波（天元）导读：当前云原生 DevOps 体系现状如何？面临哪些挑战？如何通过 OAM 解决云原生 DevOps 场景下的诸多问题？云原生开发应用模型 OAM(Open Application Model) 社区核心成员孙健波将为大家一一解答，并分享如何基于 OAM 和 Kubernetes 打造无限能力的下一代 DevOps 平台。什么是 DevOps？为什么基于 Kub