目录

摘要:

k8s源码中的处理:

sts的pod的graceful deletion

terminating 时间

graceful delete 的时间组成


摘要:

k8s的sts资源, pod所在的node处于not ready, 该pod不会被自动调度, 一直处于Terminating分析

参考:

​​​​​​Kubernetes StatefulSet 与 Deployment 在 Node NotReady代码逻辑区别 | 码农网

Kubernetes API 机制: 对象删除 - 知乎

这文章有意思的是这些:

tefulset : statefulset 会从Unknown 状态变为 Terminating 状态,执行优雅删除,detach PV,然后执行重新调度与重建操作。 
daemonset : daemonset 会从 NodeLost 状态直接变成 Running 状态,不涉及重建。 
Statefulset 为什么没有重建?
我们往往会考虑下面两个问题,statefulset 为什么没有重建? 如何保持单副本 statefulset 的高可用呢?
关于为什么没重建,首先简单介绍下 statefulset 控制器的逻辑。
Statefulset 控制器通过 StatefulSetControl 以及 StatefulPodControl 2个模块协调完成对 statefulSet 类型 workload 的状态管理(StatefulSetStatusUpdater)和扩缩控制(StatefulPodControl)。实际上,StatefulsetControl是对 StatefulPodControl 的调用来增删改 Pod。 
StatefulSet 在 podManagementPolicy 为默认值 OrderedReady 时,会按照整数顺序单调递增的依次创建 Pod,否则在 Parallel 时,虽然是按整数,但是 Pod 是同时调度与创建。 
具体的逻辑在核心方法 UpdateStatefulSet 中,见图: 

我们看到的 Stateful Pod 一直处于 Unknown 状态的原因就是因为这个控制器屏蔽了对该 Pod 的操作。因为在第一节介绍了,NodeController 的 Pod Eviction 机制已经把 Pod 标记删除,PodObj 中包含的 DeletionTimestamp 被设置,StatefulSet Controller 代码检查 IsTerminating 符合条件,便直接 return 了。 
 

k8s源码中的处理分析:

函数原型:

// updateStatefulSet performs the update function for a StatefulSet. This method creates, updates, and deletes Pods in
// the set in order to conform the system to the target state for the set. The target state always contains
// set.Spec.Replicas Pods with a Ready Condition. If the UpdateStrategy.Type for the set is
// RollingUpdateStatefulSetStrategyType then all Pods in the set must be at set.Status.CurrentRevision.
// If the UpdateStrategy.Type for the set is OnDeleteStatefulSetStrategyType, the target state implies nothing about
// the revisions of Pods in the set. If the UpdateStrategy.Type for the set is PartitionStatefulSetStrategyType, then
// all Pods with ordinal less than UpdateStrategy.Partition.Ordinal must be at Status.CurrentRevision and all other
// Pods must be at Status.UpdateRevision. If the returned error is nil, the returned StatefulSetStatus is valid and the
// update must be recorded. If the error is not nil, the method should be retried until successful.
func (ssc *defaultStatefulSetControl) updateStatefulSet(
	ctx context.Context,
	set *apps.StatefulSet,
	currentRevision *apps.ControllerRevision,
	updateRevision *apps.ControllerRevision,
	collisionCount int32,
	pods []*v1.Pod) (*apps.StatefulSetStatus, error);

这个函数太长了,截取关键代码:


	// Examine each replica with respect to its ordinal
	for i := range replicas {
		// delete and recreate failed pods
		if isFailed(replicas[i]) {
			ssc.recorder.Eventf(set, v1.EventTypeWarning, "RecreatingFailedPod",
				"StatefulSet %s/%s is recreating failed Pod %s",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			if err := ssc.podControl.DeleteStatefulPod(set, replicas[i]); err != nil {
				return &status, err
			}
			if getPodRevision(replicas[i]) == currentRevision.Name {
				status.CurrentReplicas--
			}
			if getPodRevision(replicas[i]) == updateRevision.Name {
				status.UpdatedReplicas--
			}
			status.Replicas--
			replicas[i] = newVersionedStatefulSetPod(
				currentSet,
				updateSet,
				currentRevision.Name,
				updateRevision.Name,
				i)
		}
		// If we find a Pod that has not been created we create the Pod
		if !isCreated(replicas[i]) {
			if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) {
				if isStale, err := ssc.podControl.PodClaimIsStale(set, replicas[i]); err != nil {
					return &status, err
				} else if isStale {
					// If a pod has a stale PVC, no more work can be done this round.
					return &status, err
				}
			}
			if err := ssc.podControl.CreateStatefulPod(ctx, set, replicas[i]); err != nil {
				return &status, err
			}
			status.Replicas++
			if getPodRevision(replicas[i]) == currentRevision.Name {
				status.CurrentReplicas++
			}
			if getPodRevision(replicas[i]) == updateRevision.Name {
				status.UpdatedReplicas++
			}
			// if the set does not allow bursting, return immediately
			if monotonic {
				return &status, nil
			}
			// pod created, no more work possible for this round
			continue
		}
		// If we find a Pod that is currently terminating, we must wait until graceful deletion
		// completes before we continue to make progress.
		if isTerminating(replicas[i]) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to Terminate",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}
		// If we have a Pod that has been created but is not running and ready we can not make progress.
		// We must ensure that all for each Pod, when we create it, all of its predecessors, with respect to its
		// ordinal, are Running and Ready.
		if !isRunningAndReady(replicas[i]) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to be Running and Ready",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}
		// If we have a Pod that has been created but is not available we can not make progress.
		// We must ensure that all for each Pod, when we create it, all of its predecessors, with respect to its
		// ordinal, are Available.
		// TODO: Since available is superset of Ready, once we have this featuregate enabled by default, we can remove the
		// isRunningAndReady block as only Available pods should be brought down.
		if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetMinReadySeconds) && !isRunningAndAvailable(replicas[i], set.Spec.MinReadySeconds) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to be Available",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}
		// Enforce the StatefulSet invariants
		retentionMatch := true
		if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) {
			var err error
			retentionMatch, err = ssc.podControl.ClaimsMatchRetentionPolicy(updateSet, replicas[i])
			// An error is expected if the pod is not yet fully updated, and so return is treated as matching.
			if err != nil {
				retentionMatch = true
			}
		}
		if identityMatches(set, replicas[i]) && storageMatches(set, replicas[i]) && retentionMatch {
			continue
		}
		// Make a deep copy so we don't mutate the shared cache
		replica := replicas[i].DeepCopy()
		if err := ssc.podControl.UpdateStatefulPod(updateSet, replica); err != nil {
			return &status, err
		}
	}

还是太长, 再截取关键处理:

		// If we find a Pod that is currently terminating, we must wait until graceful deletion
		// completes before we continue to make progress.
		if isTerminating(replicas[i]) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to Terminate",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}

// isTerminating returns true if pod's DeletionTimestamp has been set
func isTerminating(pod *v1.Pod) bool {
	return pod.DeletionTimestamp != nil
}

如果pod处于terminating, 就一直等待graceful deletion

sts的pod的graceful deletion

参考:

pod graceful deletion流程中的 terminating · 大专栏

记录下内容避免丢失:

Terminating 步骤是 kubectl 在删除 pod 时常见的一个步骤,但是这个状态并不是 k8s pods 对象自身的一个状态。这篇文章回答两个问题:

  • Terminating 步骤代表什么?
  • 为什么有的 pod 删除几乎看不到 Terminating 步骤,有的 pod 长时间处于 terminating 步骤?

首先,从 kubectl 的这段代码(printersinternalversionprinters.go)我们可以看到,出现 terminating 的原因是 pod 的 DeletionTimestamp 属性不为空。

// AddHandlers adds print handlers for default Kubernetes types dealing with internal versions.
func AddHandlers(h *printers.HumanReadablePrinter) {
    h.Handler(podColumns, podWideColumns, printPodList)
    h.Handler(podColumns, podWideColumns, printPod)
    h.Handler(podTemplateColumns, nil, printPodTemplate)
    h.Handler(podTemplateColumns, nil, printPodTemplateList)


func printPod(pod *api.Pod, w io.Writer, options printers.PrintOptions) error {
    if err := printPodBase(pod, w, options); err != nil {
        return err
    }

    return nil
}

func printPodBase(pod *api.Pod, w io.Writer, options printers.PrintOptions) error {
    .......
    } else if pod.DeletionTimestamp != nil {
        reason = "Terminating"
    }

DeletionTimestamp 属性(apitypes.go)的解释是 Pod 删除的时间戳,这个时间戳主要为 pod 的 graceful termination 属性设计(如果未没有设置,k8s 默认为 30s)。

// DeletionTimestamp is RFC 3339 date and time at which this resource will be deleted. This
// field is set by the server when a graceful deletion is requested by the user, and is not
// directly settable by a client. The resource is expected to be deleted (no longer visible
// from resource lists, and not reachable by name) after the time in this field. Once set,
// this value may not be unset or be set further into the future, although it may be shortened
// or the resource may be deleted prior to this time. For example, a user may request that
// a pod is deleted in 30 seconds. The Kubelet will react by sending a graceful termination
// signal to the containers in the pod. After that 30 seconds, the Kubelet will send a hard
// termination signal (SIGKILL) to the container and after cleanup, remove the pod from the
// API. In the presence of network partitions, this object may still exist after this
// timestamp, until an administrator or automated process can determine the resource is
// fully terminated.
// If not set, graceful deletion of the object has not been requested.
//
// Populated by the system when a graceful deletion is requested.
// Read-only.
// More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata
// +optional
DeletionTimestamp *metav1.Time

DeletionTimestamp 时间戳在 pod 初次 delete 时设置(apiserverpkgregistryrestdelete.go),

// BeforeDelete tests whether the object can be gracefully deleted. If graceful is set the object
// should be gracefully deleted, if gracefulPending is set the object has already been gracefully deleted
// (and the provided grace period is longer than the time to deletion), and an error is returned if the
// condition cannot be checked or the gracePeriodSeconds is invalid. The options argument may be updated with
// default values if graceful is true. Second place where we set deletionTimestamp is pkg/registry/generic/registry/store.go
// this function is responsible for setting deletionTimestamp during gracefulDeletion, other one for cascading deletions.
func BeforeDelete(strategy RESTDeleteStrategy, ctx genericapirequest.Context, obj runtime.Object, options *metav1.DeleteOptions) (graceful, gracefulPending bool, err error) {
        ......
        now := metav1.NewTime(metav1.Now().Add(time.Second * time.Duration(*options.GracePeriodSeconds)))
        objectMeta.SetDeletionTimestamp(&now)
        objectMeta.SetDeletionGracePeriodSeconds(options.GracePeriodSeconds)

apiserverpkgregistrygenericregistrystore.go

// Delete removes the item from storage.
func (e *Store) Delete(ctx genericapirequest.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
    ......
    graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options)

terminating 时间

另一个问题是为什么有的 pod 立马就删除了而有的 pod 在 terminating 状态等待一段时间?

我们知道,k8s 对 pod 的删除包括 prestop handler 的执行、pod 中 container 的删除等。而 docker 对 container 的删除分为两步,分别由两个信号量 SIGTERM 及 SIGKILL 处理。在 graceful termination 流程中,k8s 首先向 pod 中的 container 发送 SIGTERM 信号,等待 gracefulPeriodSeconds 之后再发送 SIGKILL 信号。

不考虑 prestop handler 的执行,区别出现在 container 对 SIGTERM 的处理上,如果 Container 中的 1 号进程忽略了 SIGTERM 的处理或者在收到 SIGTERM 时未退出,则 Container 将一直等待 grancePeroid 之后由 SIGKILL 结束,这将造成 Pod 一直处于 Terminating 状态一段时间。

graceful delete 的时间组成

在 pod 的 graceful deletion 流程中涉及到以下几部分时间:

  • prestop: prestop 是 pod 在结束之前执行的 handler,如果 prestop 的执行时间达到 gracePeriod(系统默认为 30s)时 prestop 程序的执行将被终止。 prestop 的执行时间算作 gracePeriod 的一部分。
  • minimumGracePeriodInSeconds:如果 prestop 的执行耗尽了 gracePeriod 使得剩余时间小于 minimumGracePeriodInSeconds,gracePeriod 将再延时至 minimumGracePeriodInSeconds。minimumGracePeriodInSeconds 默认为 2s。
  • docker StopContainer:k8s 将剩余的时间作为 docker 的 graceful period 调用 docker 的 stop container 流程。

从上面流程我们可以看出,k8s 的 pod delete 时间默认最长能够达到 32 秒。

Logo

开源、云原生的融合云平台

更多推荐