2022-01-19 k8s的sts资源的pod所在的node处于not ready,该pod不会被自动调度, 一直处于Terminating分析
摘要:k8s的sts资源, pod所在的node处于not ready, 该pod不会被自动调度, 一直处于Terminating分析参考:Kubernetes StatefulSet 与 Deployment 在 Node NotReady代码逻辑区别 | 码农网这文章有意思的是这些:tefulset : statefulset 会从Unknown 状态变为 Terminating
目录
摘要:
k8s的sts资源, pod所在的node处于not ready, 该pod不会被自动调度, 一直处于Terminating分析
参考:
Kubernetes StatefulSet 与 Deployment 在 Node NotReady代码逻辑区别 | 码农网
这文章有意思的是这些:
tefulset : statefulset 会从Unknown 状态变为 Terminating 状态,执行优雅删除,detach PV,然后执行重新调度与重建操作。
daemonset : daemonset 会从 NodeLost 状态直接变成 Running 状态,不涉及重建。
Statefulset 为什么没有重建?
我们往往会考虑下面两个问题,statefulset 为什么没有重建? 如何保持单副本 statefulset 的高可用呢?
关于为什么没重建,首先简单介绍下 statefulset 控制器的逻辑。
Statefulset 控制器通过 StatefulSetControl 以及 StatefulPodControl 2个模块协调完成对 statefulSet 类型 workload 的状态管理(StatefulSetStatusUpdater)和扩缩控制(StatefulPodControl)。实际上,StatefulsetControl是对 StatefulPodControl 的调用来增删改 Pod。
StatefulSet 在 podManagementPolicy 为默认值 OrderedReady 时,会按照整数顺序单调递增的依次创建 Pod,否则在 Parallel 时,虽然是按整数,但是 Pod 是同时调度与创建。
具体的逻辑在核心方法 UpdateStatefulSet 中,见图:我们看到的 Stateful Pod 一直处于 Unknown 状态的原因就是因为这个控制器屏蔽了对该 Pod 的操作。因为在第一节介绍了,NodeController 的 Pod Eviction 机制已经把 Pod 标记删除,PodObj 中包含的 DeletionTimestamp 被设置,StatefulSet Controller 代码检查 IsTerminating 符合条件,便直接 return 了。
k8s源码中的处理分析:
函数原型:
// updateStatefulSet performs the update function for a StatefulSet. This method creates, updates, and deletes Pods in
// the set in order to conform the system to the target state for the set. The target state always contains
// set.Spec.Replicas Pods with a Ready Condition. If the UpdateStrategy.Type for the set is
// RollingUpdateStatefulSetStrategyType then all Pods in the set must be at set.Status.CurrentRevision.
// If the UpdateStrategy.Type for the set is OnDeleteStatefulSetStrategyType, the target state implies nothing about
// the revisions of Pods in the set. If the UpdateStrategy.Type for the set is PartitionStatefulSetStrategyType, then
// all Pods with ordinal less than UpdateStrategy.Partition.Ordinal must be at Status.CurrentRevision and all other
// Pods must be at Status.UpdateRevision. If the returned error is nil, the returned StatefulSetStatus is valid and the
// update must be recorded. If the error is not nil, the method should be retried until successful.
func (ssc *defaultStatefulSetControl) updateStatefulSet(
ctx context.Context,
set *apps.StatefulSet,
currentRevision *apps.ControllerRevision,
updateRevision *apps.ControllerRevision,
collisionCount int32,
pods []*v1.Pod) (*apps.StatefulSetStatus, error);
这个函数太长了,截取关键代码:
// Examine each replica with respect to its ordinal
for i := range replicas {
// delete and recreate failed pods
if isFailed(replicas[i]) {
ssc.recorder.Eventf(set, v1.EventTypeWarning, "RecreatingFailedPod",
"StatefulSet %s/%s is recreating failed Pod %s",
set.Namespace,
set.Name,
replicas[i].Name)
if err := ssc.podControl.DeleteStatefulPod(set, replicas[i]); err != nil {
return &status, err
}
if getPodRevision(replicas[i]) == currentRevision.Name {
status.CurrentReplicas--
}
if getPodRevision(replicas[i]) == updateRevision.Name {
status.UpdatedReplicas--
}
status.Replicas--
replicas[i] = newVersionedStatefulSetPod(
currentSet,
updateSet,
currentRevision.Name,
updateRevision.Name,
i)
}
// If we find a Pod that has not been created we create the Pod
if !isCreated(replicas[i]) {
if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) {
if isStale, err := ssc.podControl.PodClaimIsStale(set, replicas[i]); err != nil {
return &status, err
} else if isStale {
// If a pod has a stale PVC, no more work can be done this round.
return &status, err
}
}
if err := ssc.podControl.CreateStatefulPod(ctx, set, replicas[i]); err != nil {
return &status, err
}
status.Replicas++
if getPodRevision(replicas[i]) == currentRevision.Name {
status.CurrentReplicas++
}
if getPodRevision(replicas[i]) == updateRevision.Name {
status.UpdatedReplicas++
}
// if the set does not allow bursting, return immediately
if monotonic {
return &status, nil
}
// pod created, no more work possible for this round
continue
}
// If we find a Pod that is currently terminating, we must wait until graceful deletion
// completes before we continue to make progress.
if isTerminating(replicas[i]) && monotonic {
klog.V(4).Infof(
"StatefulSet %s/%s is waiting for Pod %s to Terminate",
set.Namespace,
set.Name,
replicas[i].Name)
return &status, nil
}
// If we have a Pod that has been created but is not running and ready we can not make progress.
// We must ensure that all for each Pod, when we create it, all of its predecessors, with respect to its
// ordinal, are Running and Ready.
if !isRunningAndReady(replicas[i]) && monotonic {
klog.V(4).Infof(
"StatefulSet %s/%s is waiting for Pod %s to be Running and Ready",
set.Namespace,
set.Name,
replicas[i].Name)
return &status, nil
}
// If we have a Pod that has been created but is not available we can not make progress.
// We must ensure that all for each Pod, when we create it, all of its predecessors, with respect to its
// ordinal, are Available.
// TODO: Since available is superset of Ready, once we have this featuregate enabled by default, we can remove the
// isRunningAndReady block as only Available pods should be brought down.
if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetMinReadySeconds) && !isRunningAndAvailable(replicas[i], set.Spec.MinReadySeconds) && monotonic {
klog.V(4).Infof(
"StatefulSet %s/%s is waiting for Pod %s to be Available",
set.Namespace,
set.Name,
replicas[i].Name)
return &status, nil
}
// Enforce the StatefulSet invariants
retentionMatch := true
if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) {
var err error
retentionMatch, err = ssc.podControl.ClaimsMatchRetentionPolicy(updateSet, replicas[i])
// An error is expected if the pod is not yet fully updated, and so return is treated as matching.
if err != nil {
retentionMatch = true
}
}
if identityMatches(set, replicas[i]) && storageMatches(set, replicas[i]) && retentionMatch {
continue
}
// Make a deep copy so we don't mutate the shared cache
replica := replicas[i].DeepCopy()
if err := ssc.podControl.UpdateStatefulPod(updateSet, replica); err != nil {
return &status, err
}
}
还是太长, 再截取关键处理:
// If we find a Pod that is currently terminating, we must wait until graceful deletion
// completes before we continue to make progress.
if isTerminating(replicas[i]) && monotonic {
klog.V(4).Infof(
"StatefulSet %s/%s is waiting for Pod %s to Terminate",
set.Namespace,
set.Name,
replicas[i].Name)
return &status, nil
}
// isTerminating returns true if pod's DeletionTimestamp has been set
func isTerminating(pod *v1.Pod) bool {
return pod.DeletionTimestamp != nil
}
如果pod处于terminating, 就一直等待graceful deletion
sts的pod的graceful deletion
参考:
pod graceful deletion流程中的 terminating · 大专栏
记录下内容避免丢失:
Terminating 步骤是 kubectl 在删除 pod 时常见的一个步骤,但是这个状态并不是 k8s pods 对象自身的一个状态。这篇文章回答两个问题:
- Terminating 步骤代表什么?
- 为什么有的 pod 删除几乎看不到 Terminating 步骤,有的 pod 长时间处于 terminating 步骤?
首先,从 kubectl 的这段代码(printersinternalversionprinters.go)我们可以看到,出现 terminating 的原因是 pod 的 DeletionTimestamp 属性不为空。
// AddHandlers adds print handlers for default Kubernetes types dealing with internal versions. func AddHandlers(h *printers.HumanReadablePrinter) { h.Handler(podColumns, podWideColumns, printPodList) h.Handler(podColumns, podWideColumns, printPod) h.Handler(podTemplateColumns, nil, printPodTemplate) h.Handler(podTemplateColumns, nil, printPodTemplateList) func printPod(pod *api.Pod, w io.Writer, options printers.PrintOptions) error { if err := printPodBase(pod, w, options); err != nil { return err } return nil } func printPodBase(pod *api.Pod, w io.Writer, options printers.PrintOptions) error { ....... } else if pod.DeletionTimestamp != nil { reason = "Terminating" }
DeletionTimestamp 属性(apitypes.go)的解释是 Pod 删除的时间戳,这个时间戳主要为 pod 的 graceful termination 属性设计(如果未没有设置,k8s 默认为 30s)。
// DeletionTimestamp is RFC 3339 date and time at which this resource will be deleted. This // field is set by the server when a graceful deletion is requested by the user, and is not // directly settable by a client. The resource is expected to be deleted (no longer visible // from resource lists, and not reachable by name) after the time in this field. Once set, // this value may not be unset or be set further into the future, although it may be shortened // or the resource may be deleted prior to this time. For example, a user may request that // a pod is deleted in 30 seconds. The Kubelet will react by sending a graceful termination // signal to the containers in the pod. After that 30 seconds, the Kubelet will send a hard // termination signal (SIGKILL) to the container and after cleanup, remove the pod from the // API. In the presence of network partitions, this object may still exist after this // timestamp, until an administrator or automated process can determine the resource is // fully terminated. // If not set, graceful deletion of the object has not been requested. // // Populated by the system when a graceful deletion is requested. // Read-only. // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata // +optional DeletionTimestamp *metav1.Time
DeletionTimestamp 时间戳在 pod 初次 delete 时设置(apiserverpkgregistryrestdelete.go),
// BeforeDelete tests whether the object can be gracefully deleted. If graceful is set the object // should be gracefully deleted, if gracefulPending is set the object has already been gracefully deleted // (and the provided grace period is longer than the time to deletion), and an error is returned if the // condition cannot be checked or the gracePeriodSeconds is invalid. The options argument may be updated with // default values if graceful is true. Second place where we set deletionTimestamp is pkg/registry/generic/registry/store.go // this function is responsible for setting deletionTimestamp during gracefulDeletion, other one for cascading deletions. func BeforeDelete(strategy RESTDeleteStrategy, ctx genericapirequest.Context, obj runtime.Object, options *metav1.DeleteOptions) (graceful, gracefulPending bool, err error) { ...... now := metav1.NewTime(metav1.Now().Add(time.Second * time.Duration(*options.GracePeriodSeconds))) objectMeta.SetDeletionTimestamp(&now) objectMeta.SetDeletionGracePeriodSeconds(options.GracePeriodSeconds)
apiserverpkgregistrygenericregistrystore.go
// Delete removes the item from storage. func (e *Store) Delete(ctx genericapirequest.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) { ...... graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options)
terminating 时间
另一个问题是为什么有的 pod 立马就删除了而有的 pod 在 terminating 状态等待一段时间?
我们知道,k8s 对 pod 的删除包括 prestop handler 的执行、pod 中 container 的删除等。而 docker 对 container 的删除分为两步,分别由两个信号量 SIGTERM 及 SIGKILL 处理。在 graceful termination 流程中,k8s 首先向 pod 中的 container 发送 SIGTERM 信号,等待 gracefulPeriodSeconds 之后再发送 SIGKILL 信号。
不考虑 prestop handler 的执行,区别出现在 container 对 SIGTERM 的处理上,如果 Container 中的 1 号进程忽略了 SIGTERM 的处理或者在收到 SIGTERM 时未退出,则 Container 将一直等待 grancePeroid 之后由 SIGKILL 结束,这将造成 Pod 一直处于 Terminating 状态一段时间。
graceful delete 的时间组成
在 pod 的 graceful deletion 流程中涉及到以下几部分时间:
- prestop: prestop 是 pod 在结束之前执行的 handler,如果 prestop 的执行时间达到 gracePeriod(系统默认为 30s)时 prestop 程序的执行将被终止。 prestop 的执行时间算作 gracePeriod 的一部分。
- minimumGracePeriodInSeconds:如果 prestop 的执行耗尽了 gracePeriod 使得剩余时间小于 minimumGracePeriodInSeconds,gracePeriod 将再延时至 minimumGracePeriodInSeconds。minimumGracePeriodInSeconds 默认为 2s。
- docker StopContainer:k8s 将剩余的时间作为 docker 的 graceful period 调用 docker 的 stop container 流程。
从上面流程我们可以看出,k8s 的 pod delete 时间默认最长能够达到 32 秒。
更多推荐
所有评论(0)