集群压测时触发卸载盘问题排查

一. 问题说明

K8S 1.20版本集群中云硬盘类型的 PV/PVC,在进行压测时,发现租户侧 Ceph 存储响应变慢,造成 Etcd 集群间歇性不可用,
从而使 kubelet 无法按顺序正常从 apiserver 更新 Pod、PV 和 PVC 的状态。具体现象为,Pod 状态一直处于 Running 且容器未重启,但是 kubelet 将使用的云硬盘从节点上卸载(unmount、detach)后又重新进行了挂载(attach、mount),导致挂载的云硬盘盘符发生变化,容器内原有挂载点无法使用。

手动重启相应的 Pod 或容器修复。

二. 问题排查

在观察云硬盘插件日志:

早上5:11由于apiserver响应超时,kubelet把prometheus的pv unmount了,然后又从节点detach了

vim disk.e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2.root.log.INFO.20211012
# unmount
I1012 05:10:53.370019 1754833 main.go:100] handler op: --- unmount ---
I1012 05:10:53.370042 1754833 flex_volume_plugin.go:193] -Unmount- dir: /var/lib/kubelet/pods/3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e/volumes/incloud~disk/59acf53e-fc5e-433e-839e-cd53ee0e7566
I1012 05:10:53.388438 1754833 flex_volume_plugin.go:202] -Unmount- end.
I1012 05:10:53.388522 1754833 main.go:43] ResponseSuccess: "{\"status\":\"Success\"}"

I1012 05:11:06.534030 1754988 main.go:100] handler op: --- unmountdevice ---
I1012 05:11:06.534068 1754988 flex_volume_plugin.go:110] -UnmountDevice- dir: /var/lib/kubelet/plugins/kubernetes.io/flexvolume/incloud/disk/mounts/59acf53e-fc5e-433e-839e-cd53ee0e7566
I1012 05:11:06.549307 1754988 flex_volume_plugin.go:121] -UnmountDevice- volumeID: 59acf53e-fc5e-433e-839e-cd53ee0e7566
...
I1012 05:11:10.197807 1754988 flex_volume_plugin.go:154] -UnmountDevice- end.
I1012 05:11:10.197965 1754988 main.go:43] ResponseSuccess: "{\"status\":\"Success\"}"

I1012 05:11:10.405268 1755036 main.go:100] handler op: --- detach ---
I1012 05:11:10.405282 1755036 flex_volume_plugin.go:66] -Detach- volume 59acf53e-fc5e-433e-839e-cd53ee0e7566 from node a351195d-0cfe-4308-9e7c-17464866056f
I1012 05:11:10.405284 1755036 flex_volume_plugin.go:68] -Detach- volumeID: 59acf53e-fc5e-433e-839e-cd53ee0e7566
...
I1012 05:11:20.838617 1755036 ebs_api_volumes.go:131] Successfully detached volume: 59acf53e-fc5e-433e-839e-cd53ee0e7566 from compute: a351195d-0cfe-4308-9e7c-17464866056f
I1012 05:11:20.838715 1755036 main.go:43] ResponseSuccess: "{\"status\":\"Success\"}"


I1012 05:11:20.968472 1755141 main.go:100] handler op: --- attach ---
I1012 05:11:20.968554 1755141 flex_volume_plugin.go:46] -Attach- volumeID: 59acf53e-fc5e-433e-839e-cd53ee0e7566
I1012 05:11:20.968560 1755141 flex_volume_plugin.go:50] -Attach- AttachVolume begin.
...
I1012 05:11:32.257872 1755141 ebs_api_volumes.go:95] AttachDisk volumeattach.Create end.
I1012 05:11:32.257878 1755141 ebs_api_volumes.go:100] Successfully attached 59acf53e-fc5e-433e-839e-cd53ee0e7566 volume to a351195d-0cfe-4308-9e7c-17464866056f compute
...
I1012 05:11:43.765625 1755141 flex_volume_plugin.go:61] -Attach- deviceOnCloud: /dev/disk/by-id/virtio-59acf53e-fc5e-433e-8
I1012 05:11:43.765730 1755141 main.go:43] ResponseSuccess: "{\"status\":\"Success\",\"device\":\"/dev/disk/by-id/virtio-59acf53e-fc5e-433e-8\"}"

I1012 05:11:44.083264 1755611 main.go:100] handler op: --- waitforattach ---
I1012 05:11:44.083298 1755611 flex_volume_plugin.go:207] -WaitForAttach- device: /dev/disk/by-id/virtio-59acf53e-fc5e-433e-8, node: a351195d-0cfe-4308-9e7c-17464866056f
I1012 05:11:44.083312 1755611 flex_volume_plugin.go:215] -WaitForAttach- Checking volume 59acf53e-fc5e-433e-839e-cd53ee0e7566 is attached.
I1012 05:11:44.083314 1755611 volume_manager.go:170] process debug:GetDeviceByVolumeID(59acf53e-fc5e-433e-839e-cd53ee0e7566) called
I1012 05:11:44.083317 1755611 ebs_api_volumes.go:235] GetDevicePath volumeID: 59acf53e-fc5e-433e-839e-cd53ee0e7566
I1012 05:11:44.132226 1755611 ebs_api_volumes.go:246] GetDevicePath ExponentialBackoff begin.
I1012 05:11:44.132236 1755611 ebs_api_volumes.go:268] GetDevicePathBySerialID volumeID: 59acf53e-fc5e-433e-839e-cd53ee0e7566
I1012 05:11:44.132320 1755611 ebs_api_volumes.go:283] Found disk attached as virtio-59acf53e-fc5e-433e-8; full devicepath: /dev/disk/by-id/virtio-59acf53e-fc5e-433e-8

I1012 05:11:48.307920 1755696 main.go:100] handler op: --- mountdevice ---
I1012 05:11:48.307979 1755696 flex_volume_plugin.go:76] -MountDevice- deviceMountPath: /var/lib/kubelet/plugins/kubernetes.io/flexvolume/incloud/disk/mounts/59acf53e-fc5e-433e-839e-cd53ee0e7566
I1012 05:11:48.307987 1755696 flex_volume_plugin.go:78] -MountDevice- volumeID: 59acf53e-fc5e-433e-839e-cd53ee0e7566
I1012 05:11:48.307992 1755696 volume_manager.go:170] process debug:GetDeviceByVolumeID(59acf53e-fc5e-433e-839e-cd53ee0e7566) called
I1012 05:11:48.307998 1755696 ebs_api_volumes.go:235] GetDevicePath volumeID: 59acf53e-fc5e-433e-839e-cd53ee0e7566

I1012 05:11:53.904309 1755908 main.go:100] handler op: --- mount ---
I1012 05:11:53.904342 1755908 flex_volume_plugin.go:159] -Mount- deviceMountPath: /var/lib/kubelet/pods/3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e/volumes/incloud~disk/59acf53e-fc5e-433e-839e-cd53ee0e7566
I1012 05:11:53.904346 1755908 flex_volume_plugin.go:161] -Mount- volumeID: 59acf53e-fc5e-433e-839e-cd53ee0e7566
I1012 05:11:53.904349 1755908 volume_manager.go:170] process debug:GetDeviceByVolumeID(59acf53e-fc5e-433e-839e-cd53ee0e7566) called
I1012 05:11:53.904352 1755908 ebs_api_volumes.go:235] GetDevicePath volumeID: 59acf53e-fc5e-433e-839e-cd53ee0e7566
I1012 05:11:53.963697 1755908 ebs_api_volumes.go:246] GetDevicePath ExponentialBackoff begin.
I1012 05:11:53.963724 1755908 ebs_api_volumes.go:268] GetDevicePathBySerialID volumeID: 59acf53e-fc5e-433e-839e-cd53ee0e7566
I1012 05:11:53.963802 1755908 ebs_api_volumes.go:283] Found disk attached as virtio-59acf53e-fc5e-433e-8; full devicepath: /dev/disk/by-id/virtio-59acf53e-fc5e-433e-8
I1012 05:11:53.963808 1755908 flex_volume_plugin.go:167] -Mount- FormatAndMount deviceOnCloud: /dev/disk/by-id/virtio-59acf53e-fc5e-433e-8
I1012 05:11:53.975556 1755908 flex_volume_plugin.go:182] -Mount- FormatAndMount deviceMountPath: /var/lib/kubelet/pods/3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e/volumes/incloud~disk/59acf53e-fc5e-433e-839e-cd53ee0e7566
I1012 05:11:53.983676 1755908 mount_linux.go:507] `fsck` error fsck from util-linux 2.31.1
/dev/vdd is mounted.
e2fsck: Cannot continue, aborting.


I1012 05:11:54.000934 1755908 flex_volume_plugin.go:188] -Mount- end.
I1012 05:11:54.001055 1755908 main.go:43] ResponseSuccess: "{\"status\":\"Success\"}"


root@e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2:/var/log/incloud-volume# fsck -n /dev/disk/by-id/virtio-59acf53e-fc5e-433e-8
fsck from util-linux 2.31.1
e2fsck 1.44.1 (24-Mar-2018)
Warning!  /dev/vdd is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/vdd: clean, 57/2621440 files, 392546/10485760 blocks
root@e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2:/var/log/incloud-volume# fsck -a /dev/disk/by-id/virtio-59acf53e-fc5e-433e-8
fsck from util-linux 2.31.1
/dev/vdd is mounted.
e2fsck: Cannot continue, aborting.

I1012 05:11:54.000934 1755908 flex_volume_plugin.go:188] -Mount- end.
I1012 05:11:54.001055 1755908 main.go:43] ResponseSuccess: "{\"status\":\"Success\"}"

可以看到是正常走了一次卸载,挂载流程

问题有两个:

1、为会重新挂载之后容器内的目录变为只读?
2、为什么会触发卸载操作,是谁触发的?

1、手动从pod所在节点上进行detach操作
/var/lib/kubelet/volume-plugins/incloud~disk/disk detach “[volumeID]” 1
发现不会触发unmountdevice跟umount操作
detach后df -h依然能查到mount信息,但是此时盘已经不存在了,ls一下mount的目录报错 pod内目录还在,不过目录下内容丢失,不可写入
重启kubelete 会触发自动的attach,但是不会触发mountdevice跟mount,因为pod一直没有删过,detach后发现盘符变更,但mount信息没有更新
pod内目录还在,不过目录下内容丢失,不可写入,pod内mount信息为旧盘符,无法访问目录,通过重启pod后即可解决

2、排查卸载原因,并没有复现,初步怀疑是api-server缓慢导致kubelet volumemanager中的监测DesiredStateOfWorld时读不到信息,而原节点的已经在actualStateOfWorld里了,
所以会自动触发卸载流程,之后kubelet重启,重新走了挂载流程

排查kubelet代码及日志

先了解kubelet的存储工作流程:
https://zhuanlan.zhihu.com/p/35257745

可以看到kubelet运行存储相关的有几个协程

1、VolumeManager
kubelet会调用VolumeManager,为pods准备存储设备,存储设备就绪会挂载存储设备到pod所在的节点上,并在容器启动的时候挂载在容器指定的目录中;同时,删除卸载不在使用的存储;
kubelet启动时,创建kubelet,会调用volumemanager.NewVolumeManager方法创建VolumeManager
volumemanager的主要数据结构包括:kubeclient、volumePluginMgr、desiredStateOfWorld、actualStateOfWorld、operationExecutor、desiredStateOfWorldPopulator、reconciler;
volumeManager运行:pkg/kubelet/volumemanager/volume_manager.go Run
volumeManager运行会启动两个协程:desiredStateOfWorldPopulator、vm.reconciler;

2、desiredStateOfWorld和actualStateOfWorld
desiredStateOfWorld根据podManger获取所有的Pod信息,从中提取Volume信息;
actualStateOfWorld是实际的volume情况;
desiredStateOfWorldPopulator通过podManager去构建desiredStateOfWorld;
reconciler是比较actualStateOfWorld和desiredStateOfWorld的差别,然后进行volume的创建、删除和修改,最后使二者达到一致;

3、desiredStateOfWorldPopulator.Run
(1) findAndAddNewPods
通过podManager获取所有的pods,然后调用processPodVolumes去更新desiredStateOfWorld;
(2) findAndRemoveDeletedPods
会遍历所有的volumeToMount,将删除podManager中不存在的volume;
调用dswp.kubeContainerRuntime.GetPods(false)抓取runningPods信息,即正在运行的Pod信息;
一个volume可以属于多个Pod,而一个Pod可以包含多个container,每个container都可以使用volume,所以他要扫描该volume所属的Pod的container信息,确保没有container使用该volume,才会删除该volume;


通过desiredStateOfWorldPopulator.Run,构建出desiredStateOfWorld对象;
并没有发生实际的volume的mount/unmount操作。实际的操作由reconciler.Run(sourcesReady, stopCh)完成;

4、reconciler.Run
(1)首先从actualStateOfWorld获取已经挂载的volume信息,然后查看该volume是否存在于desiredStateOfWorld,假如不存在就卸载;
(2)接着从desiredStateOfWorld获取需要挂载的volumes。与actualStateOfWorld比较,假如没有挂载,则进行挂载;


查看kubelet当时的日志:

Oct 12 05:10:27 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 etcd[1538534]: 2021-10-11 21:10:23.137619 W | etcdserver: read-only range request "key:\"/registry/apiextensions.k8s.io/customresourcedefinitions/\" range_end:\"/registry/apiextensions.k8s.io/customresourcedefinitions0\" count_only:true " with result "error:etcdserver: leader changed" took too long (15.222924688s > [ETCD_WARN_APPLY_DURATION=1s]) to execute
Oct 12 05:10:27 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 etcd[1538534]: 2021-10-11 21:10:23.137638 W | etcdserver: read-only range request "key:\"/registry/csinodes/master1\" " with result "error:etcdserver: leader changed" took too long (10.95798983s > [ETCD_WARN_APPLY_DURATION=1s]) to execute
Oct 12 05:10:27 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: E1012 05:10:23.166912 1753810 desired_state_of_world_populator.go:338] Error processing volume "data" for pod "prometheus-0_monitoring(3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e)": error processing PVC monitoring/data-prometheus-0: failed to fetch PVC from API server: etcdserver: leader changed
Oct 12 05:10:27 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: W1012 05:10:23.166912 1753810 status_manager.go:550] Failed to get status for pod "component-controller-manager-master2_kube-system(f960556c062bbb8bf848ea1b6852b8cf)": etcdserver: leader changed
Oct 12 05:10:27 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: I1012 05:10:23.214420 1753810 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "config" (UniqueName: "kubernetes.io/configmap/3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e-config") pod "prometheus-0" (UID: "3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e")
Oct 12 05:10:27 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: I1012 05:10:23.214509 1753810 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "prometheus-token-slrb5" (UniqueName: "kubernetes.io/secret/3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e-prometheus-token-slrb5") pod "prometheus-0" (UID: "3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e")
Oct 12 05:10:27 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: I1012 05:10:23.214576 1753810 reconciler.go:157] Reconciler: start to sync state
Oct 12 05:10:27 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: E1012 05:10:23.338937 1753810 kubelet.go:1670] Unable to attach or mount volumes for pod "prometheus-0_monitoring(3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e)": unmounted volumes=[data prometheus-token-slrb5 config], unattached volumes=[data prometheus-token-slrb5 config]: error processing PVC monitoring/data-prometheus-0: failed to fetch PVC from API server: etcdserver: leader changed; skipping pod
Oct 12 05:10:27 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: E1012 05:10:23.339090 1753810 pod_workers.go:191] Error syncing pod 3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e ("prometheus-0_monitoring(3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e)"), skipping: unmounted volumes=[data prometheus-token-slrb5 config], unattached volumes=[data prometheus-token-slrb5 config]: error processing PVC monitoring/data-prometheus-0: failed to fetch PVC from API server: etcdserver: leader changed

Oct 12 05:10:53 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 etcd[1538534]: 2021-10-11 21:10:53.140175 W | etcdserver: read-only range request "key:\"/registry/persistentvolumeclaims/monitoring/data-prometheus-0\" " with result "error:etcdserver: request timed out" took too long (28.861299889s > [ETCD_WARN_APPLY_DURATION=1s]) to execute
Oct 12 05:10:53 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: W1012 05:10:53.198627 1753810 status_manager.go:550] Failed to get status for pod "kube-controller-manager-master2_kube-system(fcf1cd54ab510385f2c48b5e80313a1e)": etcdserver: request timed out
Oct 12 05:10:53 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: E1012 05:10:53.198627 1753810 desired_state_of_world_populator.go:338] Error processing volume "data" for pod "prometheus-0_monitoring(3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e)": error processing PVC monitoring/data-prometheus-0: failed to fetch PVC from API server: etcdserver: request timed out
Oct 12 05:10:53 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: E1012 05:10:53.245889 1753810 reconciler.go:603] updateStates in reconciler: could not get node status with error etcdserver: request timed out
Oct 12 05:10:53 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: I1012 05:10:53.346910 1753810 reconciler.go:196] operationExecutor.UnmountVolume started for volume "59acf53e-fc5e-433e-839e-cd53ee0e7566" (UniqueName: "incloud/disk/59acf53e-fc5e-433e-839e-cd53ee0e7566") pod "3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e" (UID: "3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e")
Oct 12 05:10:53 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: E1012 05:10:53.373091 1753810 kubelet.go:1670] Unable to attach or mount volumes for pod "prometheus-0_monitoring(3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e)": unmounted volumes=[data], unattached volumes=[data prometheus-token-slrb5 config]: error processing PVC monitoring/data-prometheus-0: failed to fetch PVC from API server: etcdserver: request timed out; skipping pod
Oct 12 05:10:53 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: E1012 05:10:53.373159 1753810 pod_workers.go:191] Error syncing pod 3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e ("prometheus-0_monitoring(3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e)"), skipping: unmounted volumes=[data], unattached volumes=[data prometheus-token-slrb5 config]: error processing PVC monitoring/data-prometheus-0: failed to fetch PVC from API server: etcdserver: request timed out
Oct 12 05:10:53 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 systemd[1]: Started Kubernetes systemd probe.
Oct 12 05:10:53 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: E1012 05:10:53.664603 1753810 event.go:264] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"master2.16ad1629c8aad315", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"master2", UID:"master2", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeHasSufficientMemory", Message:"Node master2 status is now: NodeHasSufficientMemory", Source:v1.EventSource{Component:"kubelet", Host:"master2"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc05147ec1f14f315, ext:6995386723, loc:(*time.Location)(0x70ce0a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc05147ec1f14f315, ext:6995386723, loc:(*time.Location)(0x70ce0a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'etcdserver: request timed out' (will not retry!)
Oct 12 05:10:54 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: E1012 05:10:54.312096 1753810 kubelet_node_status.go:94] Unable to register node "master2" with API server: etcdserver: request timed out
Oct 12 05:10:55 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: I1012 05:10:55.154791 1753810 kubelet_node_status.go:72] Attempting to register node master2

关键错误在这

Oct 12 05:10:53 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: E1012 05:10:53.198627 1753810 desired_state_of_world_populator.go:338] Error processing volume "data" for pod "prometheus-0_monitoring(3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e)": error processing PVC monitoring/data-prometheus-0: failed to fetch PVC from API server: etcdserver: request timed out
Oct 12 05:10:53 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: E1012 05:10:53.245889 1753810 reconciler.go:603] updateStates in reconciler: could not get node status with error etcdserver: request timed out
Oct 12 05:10:53 e9c959da-2470-4c3d-967e-b99bed1f69ef-master-2 kubelet[1753810]: I1012 05:10:53.346910 1753810 reconciler.go:196] operationExecutor.UnmountVolume started for volume "59acf53e-fc5e-433e-839e-cd53ee0e7566" (UniqueName: "incloud/disk/59acf53e-fc5e-433e-839e-cd53ee0e7566") pod "3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e" (UID: "3b1eb2d2-6a9c-4bc4-96ad-c46cc714490e")

对应到kubelet里的代码:

// processPodVolumes processes the volumes in the given pod and adds them to the
// desired state of the world.
func (dswp *desiredStateOfWorldPopulator) processPodVolumes(
        .
		.
		.
		for _, podVolume := range pod.Spec.Volumes {
		if !mounts.Has(podVolume.Name) && !devices.Has(podVolume.Name) {
			// Volume is not used in the pod, ignore it.
			klog.V(4).Infof("Skipping unused volume %q for pod %q", podVolume.Name, format.Pod(pod))
.			continue
.		}
.
335		pvc, volumeSpec, volumeGidValue, err :=
336			dswp.createVolumeSpec(podVolume, pod, mounts, devices)
337		if err != nil {
338			klog.Errorf(
339				"Error processing volume %q for pod %q: %v",
340				podVolume.Name,
341				format.Pod(pod),
342				err)
343			dswp.desiredStateOfWorld.AddErrorToPod(uniquePodName, err.Error())
344			allVolumesAdded = false
345			continue
.		}
.
.		// Add volume to desired state of world
		_, err = dswp.desiredStateOfWorld.AddPodToVolume(
			uniquePodName, pod, volumeSpec, podVolume.Name, volumeGidValue)
		if err != nil {
			klog.Errorf(
				"Failed to add volume %s (specName: %s) for pod %q to desiredStateOfWorld: %v",
				podVolume.Name,
				volumeSpec.Name(),
				uniquePodName,
				err)
			dswp.desiredStateOfWorld.AddErrorToPod(uniquePodName, err.Error())
			allVolumesAdded = false
		} else {
			klog.V(4).Infof(
				"Added volume %q (volSpec=%q) for pod %q to desired state.",
				podVolume.Name,
				volumeSpec.Name(),
				uniquePodName)
		}
		.
		.
		.
}

// createVolumeSpec creates and returns a mutable volume.Spec object for the
// specified volume. It dereference any PVC to get PV objects, if needed.
// Returns an error if unable to obtain the volume at this time.
func (dswp *desiredStateOfWorldPopulator) createVolumeSpec(
        .
		.
		.
        pvc, err := dswp.getPVCExtractPV(
			pod.Namespace, pvcSource.ClaimName)
		if err != nil {
			return nil, nil, "", fmt.Errorf(
				"error processing PVC %s/%s: %v",
				pod.Namespace,
				pvcSource.ClaimName,
				err)
		}
		.
		.
		.

// getPVCExtractPV fetches the PVC object with the given namespace and name from
// the API server, checks whether PVC is being deleted, extracts the name of the PV
// it is pointing to and returns it.
// An error is returned if the PVC object's phase is not "Bound".
func (dswp *desiredStateOfWorldPopulator) getPVCExtractPV(
	namespace string, claimName string) (*v1.PersistentVolumeClaim, error) {
	pvc, err :=
		dswp.kubeClient.CoreV1().PersistentVolumeClaims(namespace).Get(context.TODO(), claimName, metav1.GetOptions{})
	if err != nil || pvc == nil {
		return nil, fmt.Errorf("failed to fetch PVC from API server: %v", err)
	}

	if utilfeature.DefaultFeatureGate.Enabled(features.StorageObjectInUseProtection) {
		// Pods that uses a PVC that is being deleted must not be started.
		//
		// In case an old kubelet is running without this check or some kubelets
		// have this feature disabled, the worst that can happen is that such
		// pod is scheduled. This was the default behavior in 1.8 and earlier
		// and users should not be that surprised.
		// It should happen only in very rare case when scheduler schedules
		// a pod and user deletes a PVC that's used by it at the same time.
		if pvc.ObjectMeta.DeletionTimestamp != nil {
			return nil, errors.New("PVC is being deleted")
		}
	}

	if pvc.Status.Phase != v1.ClaimBound {
		return nil, errors.New("PVC is not bound")
	}
	if pvc.Spec.VolumeName == "" {
		return nil, errors.New("PVC has empty pvc.Spec.VolumeName")
	}

	return pvc, nil
}

可以看到,最终是因为从api-server查找pvc时出错,导致写入desiredStateOfWorld失败,再看reconciler的逻辑

func (rc *reconciler) reconcile() {
	// Unmounts are triggered before mounts so that a volume that was
	// referenced by a pod that was deleted and is now referenced by another
	// pod is unmounted from the first pod before being mounted to the new
	// pod.
	rc.unmountVolumes()

	// Next we mount required volumes. This function could also trigger
	// attach if kubelet is responsible for attaching volumes.
	// If underlying PVC was resized while in-use then this function also handles volume
	// resizing.
	rc.mountAttachVolumes()

	// Ensure devices that should be detached/unmounted are detached/unmounted.
	rc.unmountDetachDevices()
}

func (rc *reconciler) unmountVolumes() {
	// Ensure volumes that should be unmounted are unmounted.
	for _, mountedVolume := range rc.actualStateOfWorld.GetAllMountedVolumes() {
		if !rc.desiredStateOfWorld.PodExistsInVolume(mountedVolume.PodName, mountedVolume.VolumeName) {
			// Volume is mounted, unmount it
			klog.V(5).Infof(mountedVolume.GenerateMsgDetailed("Starting operationExecutor.UnmountVolume", ""))
			err := rc.operationExecutor.UnmountVolume(
				mountedVolume.MountedVolume, rc.actualStateOfWorld, rc.kubeletPodsDir)
			if err != nil &&
				!nestedpendingoperations.IsAlreadyExists(err) &&
				!exponentialbackoff.IsExponentialBackoff(err) {
				// Ignore nestedpendingoperations.IsAlreadyExists and exponentialbackoff.IsExponentialBackoff errors, they are expected.
				// Log all other errors.
				klog.Errorf(mountedVolume.GenerateErrorDetailed(fmt.Sprintf("operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled %v)", rc.controllerAttachDetachEnabled), err).Error())
			}
			if err == nil {
				klog.Infof(mountedVolume.GenerateMsgDetailed("operationExecutor.UnmountVolume started", ""))
			}
		}
	}
}

func (rc *reconciler) unmountDetachDevices() {
	for _, attachedVolume := range rc.actualStateOfWorld.GetUnmountedVolumes() {
		// Check IsOperationPending to avoid marking a volume as detached if it's in the process of mounting.
		if !rc.desiredStateOfWorld.VolumeExists(attachedVolume.VolumeName) &&
			!rc.operationExecutor.IsOperationPending(attachedVolume.VolumeName, nestedpendingoperations.EmptyUniquePodName, nestedpendingoperations.EmptyNodeName) {
			if attachedVolume.DeviceMayBeMounted() {
				// Volume is globally mounted to device, unmount it
				klog.V(5).Infof(attachedVolume.GenerateMsgDetailed("Starting operationExecutor.UnmountDevice", ""))
				err := rc.operationExecutor.UnmountDevice(
					attachedVolume.AttachedVolume, rc.actualStateOfWorld, rc.hostutil)
				if err != nil &&
					!nestedpendingoperations.IsAlreadyExists(err) &&
					!exponentialbackoff.IsExponentialBackoff(err) {
					// Ignore nestedpendingoperations.IsAlreadyExists and exponentialbackoff.IsExponentialBackoff errors, they are expected.
					// Log all other errors.
					klog.Errorf(attachedVolume.GenerateErrorDetailed(fmt.Sprintf("operationExecutor.UnmountDevice failed (controllerAttachDetachEnabled %v)", rc.controllerAttachDetachEnabled), err).Error())
				}
				if err == nil {
					klog.Infof(attachedVolume.GenerateMsgDetailed("operationExecutor.UnmountDevice started", ""))
				}
			} else {
				// Volume is attached to node, detach it
				// Kubelet not responsible for detaching or this volume has a non-attachable volume plugin.
				if rc.controllerAttachDetachEnabled || !attachedVolume.PluginIsAttachable {
					rc.actualStateOfWorld.MarkVolumeAsDetached(attachedVolume.VolumeName, attachedVolume.NodeName)
					klog.Infof(attachedVolume.GenerateMsgDetailed("Volume detached", fmt.Sprintf("DevicePath %q", attachedVolume.DevicePath)))
				} else {
					// Only detach if kubelet detach is enabled
					klog.V(5).Infof(attachedVolume.GenerateMsgDetailed("Starting operationExecutor.DetachVolume", ""))
					err := rc.operationExecutor.DetachVolume(
						attachedVolume.AttachedVolume, false /* verifySafeToDetach */, rc.actualStateOfWorld)
					if err != nil &&
						!nestedpendingoperations.IsAlreadyExists(err) &&
						!exponentialbackoff.IsExponentialBackoff(err) {
						// Ignore nestedpendingoperations.IsAlreadyExists && exponentialbackoff.IsExponentialBackoff errors, they are expected.
						// Log all other errors.
						klog.Errorf(attachedVolume.GenerateErrorDetailed(fmt.Sprintf("operationExecutor.DetachVolume failed (controllerAttachDetachEnabled %v)", rc.controllerAttachDetachEnabled), err).Error())
					}
					if err == nil {
						klog.Infof(attachedVolume.GenerateMsgDetailed("operationExecutor.DetachVolume started", ""))
					}
				}
			}
		}
	}
}

可以看到,reconciler是遍历actualStateOfWorld,如果发现在actualStateOfWorld中有而在desiredStateOfWorld不存在,则会触发unmount,UnmountVolume,最终导致了卸盘

三. 问题梳理

重新梳理一下问题产生的原因

  1. 压测时Ceph压力过大,导致集群etcd影响变慢,资源获取时断时续
  2. kubelet重启,重新进行盘的挂卸载操作,在调协过程中首先过虑节点上已有在磁盘,写入actualStateOfWorld
  3. 之后从api-server获取应该挂载的磁盘,写入desiredStateOfWorld,此时api-server不通,kubelet获取pvc信息报错,然后没有写入desiredStateOfWorld
  4. kubelet继续调协逻辑,发现在actualStateOfWorld中有而在desiredStateOfWorld不存在,触发unmount,UnmountVolume,最终导致了卸盘
  5. 之后api-server正常,desiredStateOfWorld中又写入了pvc的信息,所以又触发attach,mountVolume,mount,重新挂盘
  6. 整个过程中使用pvc的pod都没有重启,而重新挂盘之后盘符变更,导致pod内盘符跟节点上盘符不一致,挂载目录无法访问

四.解决办法

只说一下大体思路:
修改kubelet源码,对于这种因为etcd error导致获取不到pvc的情况进行特殊处理:

  1. 在desired_state_of_world_populator时发现pod的pvc是因为这种etcderror而获取不到时,对这个pod进行标记
  2. 在reconciler 时发现如果这个pod有这种etcderror的标记,则暂不处理这个pod即可,这样就不会触发到卸载操作
以上纯属个人理解 ,如有问题希望大家指正,一起交流学习!
Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐