2022-01-19 k8s的sts资源的pod所在的node处于not ready,该pod不会被自动调度, 一直处于Terminating分析

摘要:k8s的sts资源, pod所在的node处于not ready, 该pod不会被自动调度, 一直处于Terminating分析参考:Kubernetes StatefulSet 与 Deployment 在 Node NotReady代码逻辑区别 | 码农网这文章有意思的是这些:tefulset : statefulset 会从Unknown 状态变为 Terminating

帝尊悟世

2550人浏览 · 2022-01-19 20:59:35

帝尊悟世 · 2022-01-19 20:59:35 发布

摘要:

k8s源码中的处理:

sts的pod的graceful deletion

terminating 时间

graceful delete 的时间组成

摘要:

k8s的sts资源, pod所在的node处于not ready, 该pod不会被自动调度, 一直处于Terminating分析

参考:

Kubernetes StatefulSet 与 Deployment 在 Node NotReady代码逻辑区别 | 码农网

Kubernetes API 机制: 对象删除 - 知乎

这文章有意思的是这些:

tefulset : statefulset 会从Unknown 状态变为 Terminating 状态，执行优雅删除，detach PV，然后执行重新调度与重建操作。
daemonset : daemonset 会从 NodeLost 状态直接变成 Running 状态，不涉及重建。
Statefulset 为什么没有重建?
我们往往会考虑下面两个问题，statefulset 为什么没有重建？如何保持单副本 statefulset 的高可用呢？
关于为什么没重建，首先简单介绍下 statefulset 控制器的逻辑。
Statefulset 控制器通过 StatefulSetControl 以及 StatefulPodControl 2个模块协调完成对 statefulSet 类型 workload 的状态管理（StatefulSetStatusUpdater）和扩缩控制（StatefulPodControl）。实际上，StatefulsetControl是对 StatefulPodControl 的调用来增删改 Pod。
StatefulSet 在 podManagementPolicy 为默认值 OrderedReady 时，会按照整数顺序单调递增的依次创建 Pod，否则在 Parallel 时，虽然是按整数，但是 Pod 是同时调度与创建。
具体的逻辑在核心方法 UpdateStatefulSet 中，见图:

我们看到的 Stateful Pod 一直处于 Unknown 状态的原因就是因为这个控制器屏蔽了对该 Pod 的操作。因为在第一节介绍了，NodeController 的 Pod Eviction 机制已经把 Pod 标记删除，PodObj 中包含的 DeletionTimestamp 被设置，StatefulSet Controller 代码检查 IsTerminating 符合条件，便直接 return 了。

k8s源码中的处理分析:

函数原型:

// updateStatefulSet performs the update function for a StatefulSet. This method creates, updates, and deletes Pods in
// the set in order to conform the system to the target state for the set. The target state always contains
// set.Spec.Replicas Pods with a Ready Condition. If the UpdateStrategy.Type for the set is
// RollingUpdateStatefulSetStrategyType then all Pods in the set must be at set.Status.CurrentRevision.
// If the UpdateStrategy.Type for the set is OnDeleteStatefulSetStrategyType, the target state implies nothing about
// the revisions of Pods in the set. If the UpdateStrategy.Type for the set is PartitionStatefulSetStrategyType, then
// all Pods with ordinal less than UpdateStrategy.Partition.Ordinal must be at Status.CurrentRevision and all other
// Pods must be at Status.UpdateRevision. If the returned error is nil, the returned StatefulSetStatus is valid and the
// update must be recorded. If the error is not nil, the method should be retried until successful.
func (ssc *defaultStatefulSetControl) updateStatefulSet(
	ctx context.Context,
	set *apps.StatefulSet,
	currentRevision *apps.ControllerRevision,
	updateRevision *apps.ControllerRevision,
	collisionCount int32,
	pods []*v1.Pod) (*apps.StatefulSetStatus, error);

这个函数太长了,截取关键代码:


	// Examine each replica with respect to its ordinal
	for i := range replicas {
		// delete and recreate failed pods
		if isFailed(replicas[i]) {
			ssc.recorder.Eventf(set, v1.EventTypeWarning, "RecreatingFailedPod",
				"StatefulSet %s/%s is recreating failed Pod %s",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			if err := ssc.podControl.DeleteStatefulPod(set, replicas[i]); err != nil {
				return &status, err
			}
			if getPodRevision(replicas[i]) == currentRevision.Name {
				status.CurrentReplicas--
			}
			if getPodRevision(replicas[i]) == updateRevision.Name {
				status.UpdatedReplicas--
			}
			status.Replicas--
			replicas[i] = newVersionedStatefulSetPod(
				currentSet,
				updateSet,
				currentRevision.Name,
				updateRevision.Name,
				i)
		}
		// If we find a Pod that has not been created we create the Pod
		if !isCreated(replicas[i]) {
			if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) {
				if isStale, err := ssc.podControl.PodClaimIsStale(set, replicas[i]); err != nil {
					return &status, err
				} else if isStale {
					// If a pod has a stale PVC, no more work can be done this round.
					return &status, err
				}
			}
			if err := ssc.podControl.CreateStatefulPod(ctx, set, replicas[i]); err != nil {
				return &status, err
			}
			status.Replicas++
			if getPodRevision(replicas[i]) == currentRevision.Name {
				status.CurrentReplicas++
			}
			if getPodRevision(replicas[i]) == updateRevision.Name {
				status.UpdatedReplicas++
			}
			// if the set does not allow bursting, return immediately
			if monotonic {
				return &status, nil
			}
			// pod created, no more work possible for this round
			continue
		}
		// If we find a Pod that is currently terminating, we must wait until graceful deletion
		// completes before we continue to make progress.
		if isTerminating(replicas[i]) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to Terminate",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}
		// If we have a Pod that has been created but is not running and ready we can not make progress.
		// We must ensure that all for each Pod, when we create it, all of its predecessors, with respect to its
		// ordinal, are Running and Ready.
		if !isRunningAndReady(replicas[i]) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to be Running and Ready",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}
		// If we have a Pod that has been created but is not available we can not make progress.
		// We must ensure that all for each Pod, when we create it, all of its predecessors, with respect to its
		// ordinal, are Available.
		// TODO: Since available is superset of Ready, once we have this featuregate enabled by default, we can remove the
		// isRunningAndReady block as only Available pods should be brought down.
		if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetMinReadySeconds) && !isRunningAndAvailable(replicas[i], set.Spec.MinReadySeconds) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to be Available",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}
		// Enforce the StatefulSet invariants
		retentionMatch := true
		if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) {
			var err error
			retentionMatch, err = ssc.podControl.ClaimsMatchRetentionPolicy(updateSet, replicas[i])
			// An error is expected if the pod is not yet fully updated, and so return is treated as matching.
			if err != nil {
				retentionMatch = true
			}
		}
		if identityMatches(set, replicas[i]) && storageMatches(set, replicas[i]) && retentionMatch {
			continue
		}
		// Make a deep copy so we don't mutate the shared cache
		replica := replicas[i].DeepCopy()
		if err := ssc.podControl.UpdateStatefulPod(updateSet, replica); err != nil {
			return &status, err
		}
	}

还是太长, 再截取关键处理:

		// If we find a Pod that is currently terminating, we must wait until graceful deletion
		// completes before we continue to make progress.
		if isTerminating(replicas[i]) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to Terminate",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}

// isTerminating returns true if pod's DeletionTimestamp has been set
func isTerminating(pod *v1.Pod) bool {
	return pod.DeletionTimestamp != nil
}

如果pod处于terminating, 就一直等待graceful deletion

sts的pod的graceful deletion

参考:

pod graceful deletion流程中的 terminating · 大专栏

记录下内容避免丢失:

Terminating 步骤是 kubectl 在删除 pod 时常见的一个步骤，但是这个状态并不是 k8s pods 对象自身的一个状态。这篇文章回答两个问题：

Terminating 步骤代表什么？
为什么有的 pod 删除几乎看不到 Terminating 步骤，有的 pod 长时间处于 terminating 步骤？

首先，从 kubectl 的这段代码（printersinternalversionprinters.go）我们可以看到，出现 terminating 的原因是 pod 的 DeletionTimestamp 属性不为空。
// AddHandlers adds print handlers for default Kubernetes types dealing with internal versions.
func AddHandlers(h *printers.HumanReadablePrinter) {
    h.Handler(podColumns, podWideColumns, printPodList)
    h.Handler(podColumns, podWideColumns, printPod)
    h.Handler(podTemplateColumns, nil, printPodTemplate)
    h.Handler(podTemplateColumns, nil, printPodTemplateList)


func printPod(pod *api.Pod, w io.Writer, options printers.PrintOptions) error {
    if err := printPodBase(pod, w, options); err != nil {
        return err
    }

    return nil
}

func printPodBase(pod *api.Pod, w io.Writer, options printers.PrintOptions) error {
    .......
    } else if pod.DeletionTimestamp != nil {
        reason = "Terminating"
    }
DeletionTimestamp 属性（apitypes.go）的解释是 Pod 删除的时间戳，这个时间戳主要为 pod 的 graceful termination 属性设计（如果未没有设置，k8s 默认为 30s）。
// DeletionTimestamp is RFC 3339 date and time at which this resource will be deleted. This
// field is set by the server when a graceful deletion is requested by the user, and is not
// directly settable by a client. The resource is expected to be deleted (no longer visible
// from resource lists, and not reachable by name) after the time in this field. Once set,
// this value may not be unset or be set further into the future, although it may be shortened
// or the resource may be deleted prior to this time. For example, a user may request that
// a pod is deleted in 30 seconds. The Kubelet will react by sending a graceful termination
// signal to the containers in the pod. After that 30 seconds, the Kubelet will send a hard
// termination signal (SIGKILL) to the container and after cleanup, remove the pod from the
// API. In the presence of network partitions, this object may still exist after this
// timestamp, until an administrator or automated process can determine the resource is
// fully terminated.
// If not set, graceful deletion of the object has not been requested.
//
// Populated by the system when a graceful deletion is requested.
// Read-only.
// More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata
// +optional
DeletionTimestamp *metav1.Time
DeletionTimestamp 时间戳在 pod 初次 delete 时设置（apiserverpkgregistryrestdelete.go），
// BeforeDelete tests whether the object can be gracefully deleted. If graceful is set the object
// should be gracefully deleted, if gracefulPending is set the object has already been gracefully deleted
// (and the provided grace period is longer than the time to deletion), and an error is returned if the
// condition cannot be checked or the gracePeriodSeconds is invalid. The options argument may be updated with
// default values if graceful is true. Second place where we set deletionTimestamp is pkg/registry/generic/registry/store.go
// this function is responsible for setting deletionTimestamp during gracefulDeletion, other one for cascading deletions.
func BeforeDelete(strategy RESTDeleteStrategy, ctx genericapirequest.Context, obj runtime.Object, options *metav1.DeleteOptions) (graceful, gracefulPending bool, err error) {
        ......
        now := metav1.NewTime(metav1.Now().Add(time.Second * time.Duration(*options.GracePeriodSeconds)))
        objectMeta.SetDeletionTimestamp(&now)
        objectMeta.SetDeletionGracePeriodSeconds(options.GracePeriodSeconds)
apiserverpkgregistrygenericregistrystore.go
// Delete removes the item from storage.
func (e *Store) Delete(ctx genericapirequest.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
    ......
    graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options)
terminating 时间

另一个问题是为什么有的 pod 立马就删除了而有的 pod 在 terminating 状态等待一段时间？

我们知道，k8s 对 pod 的删除包括 prestop handler 的执行、pod 中 container 的删除等。而 docker 对 container 的删除分为两步，分别由两个信号量 SIGTERM 及 SIGKILL 处理。在 graceful termination 流程中，k8s 首先向 pod 中的 container 发送 SIGTERM 信号，等待 gracefulPeriodSeconds 之后再发送 SIGKILL 信号。

不考虑 prestop handler 的执行，区别出现在 container 对 SIGTERM 的处理上，如果 Container 中的 1 号进程忽略了 SIGTERM 的处理或者在收到 SIGTERM 时未退出，则 Container 将一直等待 grancePeroid 之后由 SIGKILL 结束，这将造成 Pod 一直处于 Terminating 状态一段时间。

graceful delete 的时间组成

在 pod 的 graceful deletion 流程中涉及到以下几部分时间：

prestop: prestop 是 pod 在结束之前执行的 handler，如果 prestop 的执行时间达到 gracePeriod（系统默认为 30s）时 prestop 程序的执行将被终止。 prestop 的执行时间算作 gracePeriod 的一部分。
minimumGracePeriodInSeconds：如果 prestop 的执行耗尽了 gracePeriod 使得剩余时间小于 minimumGracePeriodInSeconds，gracePeriod 将再延时至 minimumGracePeriodInSeconds。minimumGracePeriodInSeconds 默认为 2s。
docker StopContainer：k8s 将剩余的时间作为 docker 的 graceful period 调用 docker 的 stop container 流程。

从上面流程我们可以看出，k8s 的 pod delete 时间默认最长能够达到 32 秒。

Cloudpods

开源、云原生的融合云平台

更多推荐

面向未来的 IT 基础设施管理架构——融合云（Unified IaaS）

随着数字化时代的到来，IT系统已成为人类社会正常运转不可或缺的组成部分。不远的未来，智能制造，5G和人工智能等技术将成为推动生产力发展的重要引擎，人类社会将面临前所未有的全面彻底的数字化浪潮。IT基础设施作为IT系统运行的平台和载体，是实现数字化的基石。在这场数字化浪潮中，企业必须积极拥抱云计算技术，采用符合技术发展趋势、面向未来的IT基础构架，才能在未来的竞争中赢得先机。一、云计算历经十余年

Cloudpods

Cloudpods负载均衡的功能介绍

作者:周有松今天的内容会从以下几个方面展开：负载均衡产品简介。主要介绍负载均衡作为一个云上产品，它的功能模型是怎样的，日常使用中会遇到的业务词汇负载均衡的功能与典型应用场景。这部分主要结合业务词汇，对负载均衡服务中常见的一些功能选项进行介绍，并举例介绍一些典型的应用场景最后，我们做一下总结，讨论一下负载均衡产品相比传统方式的优点一、产品简介 1. 以NGINX为例提到负载均衡，我们以

Cloudpods

使用Linux vfio将Nvidia GPU透传给QEMU虚拟机

Linux 上虚拟机 GPU 透传需要使用 vfio 的方式。主要是因为在 vfio 方式下对虚拟设备的权限和 DMA 隔离上做的更好。但是这么做也有个缺点，这个物理设备在主机和其他虚拟机都不能使用了。 qemu 直接使用物理设备本身命令行是很简单的，关键在于事先在主机上对系统、内核和物理设备的一些配置。单纯从 qemu 的命令行来看，其实和普通虚拟机启动就差了最后那个-device的选项。这