k8s限制Evicted数量防止集群异常

背景

  • 生产环境下出现类似下面这种大量被驱逐的pod,如果不进行处理会在集群内产生大量的异常pod,导致集群卡顿
[root@k8s-master-1 rollout]# kubectl get pods -A
NAMESPACE     NAME                                          READY   STATUS    RESTARTS   AGE
default       busybox-deployment-rollout-55cdb64f8b-2c4dz   0/1     Evicted   0          1s
default       busybox-deployment-rollout-55cdb64f8b-7sx2r   0/1     Pending   0          1s
default       busybox-deployment-rollout-55cdb64f8b-9xnw4   0/1     Evicted   0          2s
default       busybox-deployment-rollout-55cdb64f8b-bmk28   0/1     Evicted   0          3s
default       busybox-deployment-rollout-55cdb64f8b-gz6h2   0/1     Evicted   0          4s
default       busybox-deployment-rollout-55cdb64f8b-k5wf9   0/1     Evicted   0          2s
default       busybox-deployment-rollout-55cdb64f8b-llk4d   0/1     Evicted   0          4s
default       busybox-deployment-rollout-55cdb64f8b-mlxc9   0/1     Evicted   0          4s
default       busybox-deployment-rollout-55cdb64f8b-nwv8r   0/1     Evicted   0          4s
default       busybox-deployment-rollout-55cdb64f8b-xzbch   0/1     Evicted   0          4s
kube-system   coredns-7d9b46dfb8-kgwhr                      1/1     Running   0          98m
kube-system   kube-flannel-ds-gxfjc                         1/1     Running   0          98m
kube-system   kube-flannel-ds-lb9sl                         0/1     Pending   0          35d
monitor       monitor-prometheus-node-exporter-hxrkc        0/1     Evicted   0          41s
monitor       monitor-prometheus-node-exporter-p95fh        0/1     Pending   0          35d
Eviction相关参数

Eviction,即驱逐的意思,意思是当节点出现异常时,为了保证工作负载的可用性,kubernetes将有相应的机制驱逐该节点上的Pod。

目前kubernetes中存在两种eviction机制,分别由kube-controller-managerkubelet实现。

  • kube-controller-manager实现的eviction,kube-controller-manager主要由多个控制器构成,而eviction的功能主要由node controller这个控制器实现。该Eviction会周期性检查所有节点状态,当节点处于NotReady状态超过一段时间后,驱逐该节点上所有pod。

  • kube-controller-manager提供了以下启动参数控制eviction:

    • pod-eviction-timeout:即当节点宕机该时间间隔后,开始eviction机制,驱赶宕机节点上的Pod,默认为5min。
    • node-eviction-rate:驱赶速率,即驱赶Node的速率,由令牌桶流控算法实现,默认为0.1,即每秒驱赶0.1个节点,注意这里不是驱赶Pod的速率,而是驱赶节点的速率。相当于每隔10s,清空一个节点。
    • secondary-node-eviction-rate:二级驱赶速率,当集群中宕机节点过多时,相应的驱赶速率也降低,默认为0.01。
    • unhealthy-zone-threshold:不健康zone阈值,会影响什么时候开启二级驱赶速率,默认为0.55,即当该zone中节点宕机数目超过55%,而认为该zone不健康。
    • large-cluster-size-threshold:大集群阈值,当该zone的节点多于该阈值时,则认为该zone是一个大集群。大集群节点宕机数目超过55%时,则将驱赶速率降为0.01,假如是小集群,则将速率直接降为0。
    • terminated-pod-gc-threshold:在已终止 Pod 垃圾收集器删除已终止 Pod 之前,可以保留的已终止 Pod 的个数上限。 若此值小于等于 0,则相当于禁止垃圾回收已终止的 Pod
  • kubelet的eviction机制

    • 如果节点处于资源压力,那么kubelet就会执行驱逐策略。驱逐会考虑Pod的优先级,资源使用和资源申请。当优先级相同时,资源使用/资源申请最大的Pod会被首先驱逐。

    • kube-controller-manager的eviction机制是粗粒度的,即驱赶一个节点上的所有pod,而kubelet则是细粒度的,它驱赶的是节点上的某些Pod,驱赶哪些Pod与Pod的Qos机制有关。该Eviction会周期性检查本节点内存、磁盘等资源,当资源不足时,按照优先级驱逐部分pod。

    驱逐阈值分为软驱逐阈值(Soft Eviction Thresholds)和强制驱逐(Hard Eviction Thresholds)两种机制,如下:

    • 软驱逐阈值:当node的内存/磁盘空间达到一定的阈值后,kubelet不会马上回收资源,如果改善到低于阈值就不进行驱逐,若这段时间一直高于阈值就进行驱逐。
    • 强制驱逐:强制驱逐机制则简单的多,一旦达到阈值,直接把pod从本地驱逐。

    kubelet提供了以下参数控制eviction:

    • eviction-soft:软驱逐阈值设置,具有一系列阈值,比如memory.available<1.5Gi时,它不会立即执行pod eviction,而会等待eviction-soft-grace-period时间,假如该时间过后,依然还是达到了eviction-soft,则触发一次pod eviction。
    • eviction-soft-grace-period:默认为90秒,当eviction-soft时,终止Pod的grace的时间,即软驱逐宽限期,软驱逐信号与驱逐处理之间的时间差。
    • eviction-max-pod-grace-period:最大驱逐pod宽限期,停止信号与kill之间的时间差。
    • eviction-pressure-transition-period:默认为5分钟,驱逐压力过渡时间,超过阈值时,节点会被设置为memory pressure或者disk pressure,然后开启pod eviction。
    • eviction-minimum-reclaim:表示每一次eviction必须至少回收多少资源。
    • eviction-hard:强制驱逐设置,也具有一系列的阈值,比如memory.available<1Gi,即当节点可用内存低于1Gi时,会立即触发一次pod eviction。

模拟产生驱逐pod

  • 修改kubelet参数,将驱逐条件设大
[root@k8s-master-1 rollout]# cat /etc/kubernetes/kubelet.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
....................................... # 略
cgroupDriver: systemd
cgroupsPerQOS: true
eventBurst: 10
eventRecordQPS: 5
evictionHard:
  imagefs.available: 15%
  memory.available: 100Mi
  nodefs.available: 95%                    # 如果节点没有剩下95的磁盘空间,就驱逐节点上的pod
  nodefs.inodesFree: 5%
evictionPressureTransitionPeriod: 5m0s
....................................... # 略
  • 测试yaml文件, 一定要指定调度到某个节点,不然不会大量产生被驱逐的pod
[root@k8s-master-1 rollout]# cat busybox-deployment-rollout.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox-deployment-rollout
spec:
  replicas: 1
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      containers:
      - name: busybox-2
        image: busybox:1.28
        imagePullPolicy: IfNotPresent
        command: ["/bin/sh","-c","sleep 10000"]
      nodeName: k8s-master-1          
      
# 查看产生了多少被驱逐的pod
[root@k8s-master-1 rollout]# kubectl get pods -A | grep Evicted | wc -l
476

kube-controller-manager添加terminated-pod-gc-threshold参数

[root@k8s-master-1 rollout]# cat /usr/lib/systemd/system/kube-controller-manager.service 
[Unit]                                                                     
Description=Kubernetes Controller Manager
Documentation=https://github.com/kubernetes/kubernetes
[Service]      
ExecStart=/usr/bin/kube-controller-manager \
    --port=10252 \
    --secure-port=10257 \
    --bind-address=127.0.0.1 \
    --kubeconfig=/etc/kubernetes/kube-controller-manager.kubeconfig \
    --service-cluster-ip-range=10.0.0.0/16 \
    --cluster-name=kubernetes \
    --cluster-signing-cert-file=/etc/kubernetes/ssl/kube-apiserver-ca.pem \
    --cluster-signing-key-file=/etc/kubernetes/ssl/kube-apiserver-ca-key.pem \
    --cluster-signing-duration=87600h \
    --allocate-node-cidrs=true \
    --cluster-cidr=10.70.0.0/16 \
    --node-cidr-mask-size=24 \
    --root-ca-file=/etc/kubernetes/ssl/kube-apiserver-ca.pem \
    --service-account-private-key-file=/etc/kubernetes/ssl/service.key \
    --use-service-account-credentials=true \
    --leader-elect=true \
    --feature-gates=RotateKubeletServerCertificate=true,RotateKubeletClientCertificate=true,EphemeralContainers=true \
    --controllers=*,bootstrapsigner,tokencleaner \
    --tls-cert-file=/etc/kubernetes/ssl/kube-controller-manager.pem \
    --tls-private-key-file=/etc/kubernetes/ssl/kube-controller-manager-key.pem \
    --requestheader-client-ca-file=/etc/kubernetes/ssl/front-proxy-ca.pem \
    --requestheader-allowed-names=front-proxy-client   \
    --requestheader-extra-headers-prefix=X-Remote-Extra-  \
    --requestheader-group-headers=X-Remote-Group     \
    --requestheader-username-headers=X-Remote-User   \
    --horizontal-pod-autoscaler-use-rest-clients=true \
    --alsologtostderr=true \
    --logtostderr=false \
    --log-dir=/var/log/kubernetes \
    --v=2  \
    --terminated-pod-gc-threshold=100 \            # 设置在已终止 Pod 垃圾收集器删除已终止 Pod 之前,可以保留的已终止 Pod 的个数上限
Restart=on-failure
RestartSec=5   
[Install]      
WantedBy=multi-user.target

# 重启kube-controller-manger
[root@k8s-master-1 rollout]# systemctl daemon-reload && systemctl restart kube-controller-manager.service
  • 可以发现在配置terminated-pod-gc-threshold=100后,Evicted pod数量并未增加,但是k8s也未自动回收这些异常pod
[root@k8s-master-1 rollout]# kubectl get pods -A | wc -l
507
[root@k8s-master-1 rollout]# kubectl get pods -A | wc -l
507
[root@k8s-master-1 rollout]# kubectl get pods -A | wc -l
507
[root@k8s-master-1 rollout]# kubectl get pods -A | wc -l
507
[root@k8s-master-1 rollout]# kubectl get pods -A | wc -l
507
  • 重新部署,看Evicted pod数量是否会超过5,可以发现被驱逐的pod还是超过5了,但是也可以发现kube-controller-manager开始回收Evicted pod了
# 删除deployment
[root@k8s-master-1 rollout]# kubectl delete -f busybox-deployment-rollout.yaml 
deployment.apps "busybox-deployment-rollout" deleted

# 清理pod
[root@k8s-master-1 rollout]# kubectl get pods -A  | grep "Evicted" | awk 'NR>1{print $2}' | xargs kubectl delete pods

# 检查是否还有被驱逐的pod
[root@k8s-master-1 rollout]# kubectl get pods -A  | grep "Evicted" | wc -l
0


# 重新部署测试yaml
[root@k8s-master-1 rollout]# kubectl get pods -A | grep Evicted | wc -l
9
[root@k8s-master-1 rollout]# kubectl get pods -A | grep Evicted | wc -l
14
[root@k8s-master-1 rollout]# kubectl get pods -A | grep Evicted | wc -l
18
[root@k8s-master-1 rollout]# kubectl get pods -A | grep Evicted | wc -l
7
  • 针对上面的的情况我们修改一下kube-controller-manger参数
[root@k8s-master-1 rollout]# cat  /usr/lib/systemd/system/kube-controller-manager.service 
[Unit]                                                                     
Description=Kubernetes Controller Manager
Documentation=https://github.com/kubernetes/kubernetes
[Service]      
ExecStart=/usr/bin/kube-controller-manager \
    --port=10252 \
    --secure-port=10257 \
    --bind-address=127.0.0.1 \
    --kubeconfig=/etc/kubernetes/kube-controller-manager.kubeconfig \
    --service-cluster-ip-range=10.0.0.0/16 \
    --cluster-name=kubernetes \
    --cluster-signing-cert-file=/etc/kubernetes/ssl/kube-apiserver-ca.pem \
    --cluster-signing-key-file=/etc/kubernetes/ssl/kube-apiserver-ca-key.pem \
    --cluster-signing-duration=87600h \
    --allocate-node-cidrs=true \
    --cluster-cidr=10.70.0.0/16 \
    --node-cidr-mask-size=24 \
    --root-ca-file=/etc/kubernetes/ssl/kube-apiserver-ca.pem \
    --service-account-private-key-file=/etc/kubernetes/ssl/service.key \
    --use-service-account-credentials=true \
    --leader-elect=true \
    --feature-gates=RotateKubeletServerCertificate=true,RotateKubeletClientCertificate=true,EphemeralContainers=true \
    --controllers=*,bootstrapsigner,tokencleaner \
    --tls-cert-file=/etc/kubernetes/ssl/kube-controller-manager.pem \
    --tls-private-key-file=/etc/kubernetes/ssl/kube-controller-manager-key.pem \
    --requestheader-client-ca-file=/etc/kubernetes/ssl/front-proxy-ca.pem \
    --requestheader-allowed-names=front-proxy-client   \
    --requestheader-extra-headers-prefix=X-Remote-Extra-  \
    --requestheader-group-headers=X-Remote-Group     \
    --requestheader-username-headers=X-Remote-User   \
    --horizontal-pod-autoscaler-use-rest-clients=true \
    --alsologtostderr=true \
    --logtostderr=false \
    --log-dir=/var/log/kubernetes \
    --v=2  \
    --terminated-pod-gc-threshold=5 \
    --concurrent-gc-syncs=50 \                    # gc并发数,将其调高可以提升回收速度
    --node-eviction-rate=0.05                     # 将该值从0.1修改为0.05 ,设置为20s内清空节点上的所有pod,当节点不健康的时候
Restart=on-failure
RestartSec=5   
[Install]      
WantedBy=multi-user.target

# 添加上面二个参数后,实际被驱逐的速率增长下降,回收的数率提升,由于不方便演示,这里不做展示

参考链接:

https://support.huaweicloud.com/cce_faq/cce_faq_00209.html
https://kubernetes.io/zh-cn/docs/reference/command-line-tools-reference/kube-controller-manager/

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐