k8s statefulset controller源码分析

statefulset controller是kube-controller-manager组件中众多控制器中的一个，是 statefulset 资源的控制器，通过对statefulset、pod资源监听，当资源发生变化时会触发 statefulset controller 对相应的statefulset资源对象进行调谐操作，从而完成statefulset对于pod的创建、删除、更新、扩缩容、st

良凯尔

2115人浏览 · 2021-11-28 10:25:21

良凯尔 · 2021-11-28 10:25:21 发布

statefulset controller分析

statefulset简介

statefulset是Kubernetes提供的管理有状态应用的对象，而deployment用于管理无状态应用。

有状态的pod与无状态的pod不一样的是，有状态的pod有时候需要通过其主机名来定位，而无状态的不需要，因为无状态的pod每个都是一样的，随机选一个就行，但对于有状态的来说，每一个pod都不一样，通常希望操作的是特定的某一个。

statefulset适用于需要以下特点的应用：
（1）稳定的网络标志（主机名）：Pod重新调度后其PodName和HostName不变；
（2）稳定的持久化存储：基于PVC，Pod重新调度后仍能访问到相同的持久化数据；
（3）稳定的创建与扩容次序：有序创建或扩容，从序号小到大的顺序对pod进行创建（即从0到N-1），且在下一个Pod创建运行之前，所有之前的Pod必须都是Running和Ready状态；
（4）稳定的删除与缩容次序：有序删除或缩容，从序号大到小的顺序对pod进行删除（即从N-1到0），且在下一个Pod终止与删除之前，所有之前的Pod必须都已经被删除；
（5）稳定的滚动更新次序：从序号大到小的顺序对pod进行更新（即从N-1到0），先删除后创建，且需等待当前序号的pod再次创建完成且状态为Ready时才能进行下一个pod的更新处理。

statefulset controller简介

statefulset controller是kube-controller-manager组件中众多控制器中的一个，是 statefulset 资源对象的控制器，其通过对statefulset、pod资源的监听，当资源发生变化时会触发 statefulset controller 对相应的statefulset资源对象进行调谐操作，从而完成statefulset对于pod的创建、删除、更新、扩缩容、statefulset的滚动更新、statefulset状态status更新、旧版本statefulset清理等操作。

statefulset controller架构图

statefulset controller的大致组成和处理流程如下图，statefulset controller对statefulset、pod对象注册了event handler，当有事件时，会watch到然后将对应的statefulset对象放入到queue中，然后syncStatefulSet方法为statefulset controller调谐statefulset对象的核心处理逻辑所在，从queue中取出statefulset对象，做调谐处理。
在这里插入图片描述

statefulset pod的命名规则、pod创建与删除

如果创建一个名称为web、replicas为3的statefulset对象，则其pod名称分别为web-0、web-1、web-2。

statefulset pod的创建按0 - n的顺序创建，且在创建下一个pod之前，需要等待前一个pod创建完成并处于ready状态。

同样拿上面的例子来说明，在web statefulset创建后，3 个 Pod 将按顺序创建 web-0，web-1，web-2。在 web-0 处于 ready 状态之前，web-1 将不会被创建，同样当 web-1 处于ready状态之前 web-2也不会被创建。如果在 web-1 ready后，web-2 创建之前， web-0 不处于ready状态了，这个时候 web-2 将不会被创建，直到 web-0 再次回到ready状态。

statefulset滚动更新或缩容过程中pod的删除按n - 0的顺序删除，且在删除下一个pod之前，需要等待前一个pod删除完成。

另外，当statefulset.Spec.VolumeClaimTemplates中定义了pod所需的pvc时，statefulset controller在创建pod时，会同时创建对应的pvc出来，但删除pod时，不会做对应pvc的删除操作，这些pvc需要人工额外做删除操作。

statefulset更新策略

（1）OnDelete：使用 OnDelete 更新策略时，在更新 statefulset pod模板后，只有当你手动删除老的 statefulset pods 之后，新的 statefulset Pod 才会被自动创建。
（2）RollingUpdate：使用 RollingUpdate 更新策略时，在更新 statefulset pod模板后，老的 statefulset pods 将被删除，并且将根据滚动更新配置自动创建新的 statefulset pods。滚动更新期间，每个序号的statefulset pod最多只能有一个，且滚动更新下一个pod之前，需等待前一个pod更新完成并处于ready状态。与statefulset pod按0 - n的顺序创建不同，滚动更新时Pod按逆序的方式（即n - 0）删除并创建。

statefulset的滚动升级中还有一个Partition配置，在设置partition后，滚动更新过程中，statefulset的Pod中序号大于或等于partition的Pod会进行滚动升级，而其余的Pod保持不变，不会进行滚动更新。

statefulset controller分析将分为两大块进行，分别是：
（1）statefulset controller初始化与启动分析；
（2）statefulset controller处理逻辑分析。

1.statefulset controller初始化与启动分析

基于tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

直接看到startStatefulSetController函数，作为statefulset controller初始化与启动分析的入口。

startStatefulSetController

startStatefulSetController主要逻辑：
（1）调用statefulset.NewStatefulSetController新建并初始化StatefulSetController；
（2）拉起一个goroutine，跑StatefulSetController的Run方法。

// cmd/kube-controller-manager/app/apps.go
func startStatefulSetController(ctx ControllerContext) (http.Handler, bool, error) {
	if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "statefulsets"}] {
		return nil, false, nil
	}
	go statefulset.NewStatefulSetController(
		ctx.InformerFactory.Core().V1().Pods(),
		ctx.InformerFactory.Apps().V1().StatefulSets(),
		ctx.InformerFactory.Core().V1().PersistentVolumeClaims(),
		ctx.InformerFactory.Apps().V1().ControllerRevisions(),
		ctx.ClientBuilder.ClientOrDie("statefulset-controller"),
	).Run(int(ctx.ComponentConfig.StatefulSetController.ConcurrentStatefulSetSyncs), ctx.Stop)
	return nil, true, nil
}

1.1 statefulset.NewStatefulSetController

从statefulset.NewStatefulSetController函数代码中可以看到，statefulset controller注册了statefulset、pod对象的EventHandler，也即对这几个对象的event进行监听，把event放入事件队列并做处理，对statefulset controller做了初始化。

// pkg/controller/statefulset/stateful_set.go
func NewStatefulSetController(
	podInformer coreinformers.PodInformer,
	setInformer appsinformers.StatefulSetInformer,
	pvcInformer coreinformers.PersistentVolumeClaimInformer,
	revInformer appsinformers.ControllerRevisionInformer,
	kubeClient clientset.Interface,
) *StatefulSetController {
	eventBroadcaster := record.NewBroadcaster()
	eventBroadcaster.StartLogging(klog.Infof)
	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})
	recorder := eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "statefulset-controller"})

	ssc := &StatefulSetController{
		kubeClient: kubeClient,
		control: NewDefaultStatefulSetControl(
			NewRealStatefulPodControl(
				kubeClient,
				setInformer.Lister(),
				podInformer.Lister(),
				pvcInformer.Lister(),
				recorder),
			NewRealStatefulSetStatusUpdater(kubeClient, setInformer.Lister()),
			history.NewHistory(kubeClient, revInformer.Lister()),
			recorder,
		),
		pvcListerSynced: pvcInformer.Informer().HasSynced,
		queue:           workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "statefulset"),
		podControl:      controller.RealPodControl{KubeClient: kubeClient, Recorder: recorder},

		revListerSynced: revInformer.Informer().HasSynced,
	}

	podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		// lookup the statefulset and enqueue
		AddFunc: ssc.addPod,
		// lookup current and old statefulset if labels changed
		UpdateFunc: ssc.updatePod,
		// lookup statefulset accounting for deletion tombstones
		DeleteFunc: ssc.deletePod,
	})
	ssc.podLister = podInformer.Lister()
	ssc.podListerSynced = podInformer.Informer().HasSynced

	setInformer.Informer().AddEventHandler(
		cache.ResourceEventHandlerFuncs{
			AddFunc: ssc.enqueueStatefulSet,
			UpdateFunc: func(old, cur interface{}) {
				oldPS := old.(*apps.StatefulSet)
				curPS := cur.(*apps.StatefulSet)
				if oldPS.Status.Replicas != curPS.Status.Replicas {
					klog.V(4).Infof("Observed updated replica count for StatefulSet: %v, %d->%d", curPS.Name, oldPS.Status.Replicas, curPS.Status.Replicas)
				}
				ssc.enqueueStatefulSet(cur)
			},
			DeleteFunc: ssc.enqueueStatefulSet,
		},
	)
	ssc.setLister = setInformer.Lister()
	ssc.setListerSynced = setInformer.Informer().HasSynced

	// TODO: Watch volumes
	return ssc
}

1.2 Run

主要看到for循环处，根据workers的值（可通过kube-controller-manager组件启动参数concurrent-statefulset-syncs来设置，默认值为5），启动相应数量的goroutine，跑ssc.worker方法，调用daemonset controller核心处理方法ssc.sync来调谐statefulset对象。

// pkg/controller/statefulset/stateful_set.go
func (ssc *StatefulSetController) Run(workers int, stopCh <-chan struct{}) {
	defer utilruntime.HandleCrash()
	defer ssc.queue.ShutDown()

	klog.Infof("Starting stateful set controller")
	defer klog.Infof("Shutting down statefulset controller")

	if !cache.WaitForNamedCacheSync("stateful set", stopCh, ssc.podListerSynced, ssc.setListerSynced, ssc.pvcListerSynced, ssc.revListerSynced) {
		return
	}

	for i := 0; i < workers; i++ {
		go wait.Until(ssc.worker, time.Second, stopCh)
	}

	<-stopCh
}

1.2.1 ssc.worker

从queue队列中取出事件key，并调用ssc.sync（关于ssc.sync方法会在后面做详细分析）对statefulset对象做调谐处理。queue队列里的事件来源前面讲过，是statefulset controller注册的statefulset、pod对象的EventHandler，它们的变化event会被监听到然后放入queue中。

// pkg/controller/daemon/daemon_controller.go
func (ssc *StatefulSetController) worker() {
	for ssc.processNextWorkItem() {
	}
}

func (ssc *StatefulSetController) processNextWorkItem() bool {
	key, quit := ssc.queue.Get()
	if quit {
		return false
	}
	defer ssc.queue.Done(key)
	if err := ssc.sync(key.(string)); err != nil {
		utilruntime.HandleError(fmt.Errorf("Error syncing StatefulSet %v, requeuing: %v", key.(string), err))
		ssc.queue.AddRateLimited(key)
	} else {
		ssc.queue.Forget(key)
	}
	return true
}

2.statefulset controller核心处理逻辑分析

sync

直接看到statefulset controller核心处理入口sync方法。

主要逻辑：
（1）获取执行方法时的当前时间，并定义defer函数，用于计算该方法总执行时间，也即统计对一个 statefulset 进行同步调谐操作的耗时；
（2）根据 statefulset 对象的命名空间与名称，获取 statefulset 对象；
（3）调用ssc.adoptOrphanRevisions，检查是否有孤儿 controllerrevisions 对象（即.spec.ownerReferences中无controller属性定义或其属性值为false），若有且其与 statefulset 对象的selector匹配的则添加 ownerReferences 进行关联；
（4）调用ssc.getPodsForStatefulSet，根据 statefulset 对象的selector去查找pod列表，且若有孤儿pod的label与 statefulset 的selector能匹配的则进行关联，若已关联的pod的label不再与statefulset的selector匹配，则更新解除它们的关联关系；
（5）调用ssc.syncStatefulSet，对 statefulset 对象做调谐处理。

// pkg/controller/statefulset/stateful_set.go
func (ssc *StatefulSetController) sync(key string) error {
	startTime := time.Now()
	defer func() {
		klog.V(4).Infof("Finished syncing statefulset %q (%v)", key, time.Since(startTime))
	}()

	namespace, name, err := cache.SplitMetaNamespaceKey(key)
	if err != nil {
		return err
	}
	set, err := ssc.setLister.StatefulSets(namespace).Get(name)
	if errors.IsNotFound(err) {
		klog.Infof("StatefulSet has been deleted %v", key)
		return nil
	}
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("unable to retrieve StatefulSet %v from store: %v", key, err))
		return err
	}

	selector, err := metav1.LabelSelectorAsSelector(set.Spec.Selector)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("error converting StatefulSet %v selector: %v", key, err))
		// This is a non-transient error, so don't retry.
		return nil
	}

	if err := ssc.adoptOrphanRevisions(set); err != nil {
		return err
	}

	pods, err := ssc.getPodsForStatefulSet(set, selector)
	if err != nil {
		return err
	}

	return ssc.syncStatefulSet(set, pods)
}

2.1 ssc.getPodsForStatefulSet

ssc.getPodsForStatefulSet方法主要作用是获取属于 statefulset 对象的pod列表并返回，并检查孤儿pod与已匹配的pod，看是否需要更新statefulset与pod的匹配。

主要逻辑：
（1）获取 statefulset 所在命名空间下的所有pod；
（2）定义过滤出属于 statefulset 对象的pod的函数，即isMemberOf函数（根据pod的名称与statefulset名称匹配来过滤属于statefulset的pod）；
（3）调用cm.ClaimPods，过滤出属于该statefulset对象的pod，且若有孤儿pod的label与 statefulset 的selector能匹配的则进行关联，若已关联的pod的label不再与statefulset的selector匹配，则更新解除它们的关联关系。

// pkg/controller/statefulset/stateful_set.go
func (ssc *StatefulSetController) getPodsForStatefulSet(set *apps.StatefulSet, selector labels.Selector) ([]*v1.Pod, error) {
	// List all pods to include the pods that don't match the selector anymore but
	// has a ControllerRef pointing to this StatefulSet.
	pods, err := ssc.podLister.Pods(set.Namespace).List(labels.Everything())
	if err != nil {
		return nil, err
	}

	filter := func(pod *v1.Pod) bool {
		// Only claim if it matches our StatefulSet name. Otherwise release/ignore.
		return isMemberOf(set, pod)
	}

	// If any adoptions are attempted, we should first recheck for deletion with
	// an uncached quorum read sometime after listing Pods (see #42639).
	canAdoptFunc := controller.RecheckDeletionTimestamp(func() (metav1.Object, error) {
		fresh, err := ssc.kubeClient.AppsV1().StatefulSets(set.Namespace).Get(set.Name, metav1.GetOptions{})
		if err != nil {
			return nil, err
		}
		if fresh.UID != set.UID {
			return nil, fmt.Errorf("original StatefulSet %v/%v is gone: got uid %v, wanted %v", set.Namespace, set.Name, fresh.UID, set.UID)
		}
		return fresh, nil
	})

	cm := controller.NewPodControllerRefManager(ssc.podControl, set, selector, controllerKind, canAdoptFunc)
	return cm.ClaimPods(pods, filter)
}

2.2 ssc.syncStatefulSet

ssc.syncStatefulSet方法可以说是statefulset controller的核心处理逻辑所在了，主要看到ssc.control.UpdateStatefulSet方法。

// pkg/controller/statefulset/stateful_set.go
func (ssc *StatefulSetController) syncStatefulSet(set *apps.StatefulSet, pods []*v1.Pod) error {
	klog.V(4).Infof("Syncing StatefulSet %v/%v with %d pods", set.Namespace, set.Name, len(pods))
	// TODO: investigate where we mutate the set during the update as it is not obvious.
	if err := ssc.control.UpdateStatefulSet(set.DeepCopy(), pods); err != nil {
		return err
	}
	klog.V(4).Infof("Successfully synced StatefulSet %s/%s successful", set.Namespace, set.Name)
	return nil
}

ssc.control.UpdateStatefulSet方法主要逻辑：
（1）获取statefulset的所有ControllerRevision并根据版本新老顺序排序；
（2）调用ssc.getStatefulSetRevisions，获取现存最新的statefulset版本以及计算出一个新的版本；
（3）调用ssc.updateStatefulSet，完成statefulset对象对于pod的创建、删除、更新、扩缩容等操作；
（4）调用ssc.updateStatefulSetStatus，更新statefulset对象的status状态；
（5）调用ssc.truncateHistory，根据statefulset对象配置的历史版本数量限制，按之前的排序顺序清理掉没有pod的statefulset历史版本。

// pkg/controller/statefulset/stateful_set_control.go
func (ssc *defaultStatefulSetControl) UpdateStatefulSet(set *apps.StatefulSet, pods []*v1.Pod) error {

	// list all revisions and sort them
	revisions, err := ssc.ListRevisions(set)
	if err != nil {
		return err
	}
	history.SortControllerRevisions(revisions)

	// get the current, and update revisions
	currentRevision, updateRevision, collisionCount, err := ssc.getStatefulSetRevisions(set, revisions)
	if err != nil {
		return err
	}

	// perform the main update function and get the status
	status, err := ssc.updateStatefulSet(set, currentRevision, updateRevision, collisionCount, pods)
	if err != nil {
		return err
	}

	// update the set's status
	err = ssc.updateStatefulSetStatus(set, status)
	if err != nil {
		return err
	}

	klog.V(4).Infof("StatefulSet %s/%s pod status replicas=%d ready=%d current=%d updated=%d",
		set.Namespace,
		set.Name,
		status.Replicas,
		status.ReadyReplicas,
		status.CurrentReplicas,
		status.UpdatedReplicas)

	klog.V(4).Infof("StatefulSet %s/%s revisions current=%s update=%s",
		set.Namespace,
		set.Name,
		status.CurrentRevision,
		status.UpdateRevision)

	// maintain the set's revision history limit
	return ssc.truncateHistory(set, pods, revisions, currentRevision, updateRevision)
}

2.2.1 ssc.updateStatefulSet

updateStatefulSet方法是statefulset对象调谐操作中的核心方法，完成statefulset对象对于pod的创建、删除、更新、扩缩容等操作。此方法代码比较长，跟随我的节奏慢慢分析。

主要逻辑：
（1）看到第一个for循环，将statefulset的所有pod按ord（ord即为pod name中的序号）的值分到replicas和condemned两个数组中，序号小于statefulset对象期望副本数值的放到replicas数组（因为序号从0开始，所以是小于期望副本数值），大于等于期望副本数值的放到condemned数组，replicas数组代表正常的可用的pod列表，condemned数组中的是需要被删除的pod列表；在遍历pod时，同时根据pod的状态计算statefulset对象的status值；
（2）第二个for循环，当序号小于statefulset期望副本数值的pod未创建出来时，则根据statefulset对象中的pod模板，构建出相应序号值的pod对象（此时还并没有向apiserver发起创建pod的请求，只是构建好pod结构体）；
（3）第三个和第四个for循环，遍历replicas和condemned两个数组，找到非healthy状态的最小序号的pod记录下来，并记录序号；
（4）当statefulset对象的DeletionTimestamp不为nil时，直接返回前面计算出来的statefulset的新status值，不再进行方法后续的逻辑处理；
（5）获取monotonic的值，当statefulset.Spec.PodManagementPolicy的值为Parallel时，monotonic的值为false，否则为true（Parallel代表statefulset controller可以并行的处理同一statefulset的pod，串行则代表在启动和终止下一个pod之前需要等待前一个pod变成ready状态或pod对象被删除掉）；
（6）第五个for循环，遍历replicas数组，处理statefulset的pod，主要是做pod的创建（包括根据statefulset.Spec.VolumeClaimTemplates中定义的pod所需的pvc的创建）：
（6.1）当pod处于fail状态（pod.Status.Phase的值为Failed）时，调用apiserver删除该pod（pod对应的pvc在这里不会做删除操作）并给replicas数组构建相应序号的新的pod结构体（用于下一步中重新创建该序号的pod）;
（6.2）如果相应序号的pod未创建时，调用apiserver创建该序号的pod（包括创建pvc），且当monotonic为true时（statefulset没有配置Parallel），直接return，结束updateStatefulSet方法的执行；
（6.3）剩下的逻辑就是当没有配置Parallel时，将串行处理pod，在启动和终止下一个pod之前需要等待前一个pod变成ready状态或pod对象被删除掉，不再展开分析；
（7）第六个for循环，逆序（pod序号从大到小）遍历condemned数组，处理statefulset的pod，主要是做多余pod的删除，删除逻辑也受Parallel影响，不展开分析。
（8）判断statefulset的更新策略，若为OnDelete，则直接return（使用了该更新策略，则需要人工删除pod后才会重建相应序号的pod）；
（9）获取滚动更新配置中的Partition值，当statefulset进行滚动更新时，小于等于该序号的pod将不会被更新；
（10）第七个for循环，主要是处理更新策略为RollingUpdate的statefulset对象的更新。statefulset的滚动更新，从序号大到小的顺序对pod进行更新，先删除后创建，且需等待当前序号的pod再次创建完成且状态为Ready时才能进行下一个pod的更新处理。

// pkg/controller/statefulset/stateful_set_control.go
func (ssc *defaultStatefulSetControl) updateStatefulSet(
	set *apps.StatefulSet,
	currentRevision *apps.ControllerRevision,
	updateRevision *apps.ControllerRevision,
	collisionCount int32,
	pods []*v1.Pod) (*apps.StatefulSetStatus, error) {
	// get the current and update revisions of the set.
	currentSet, err := ApplyRevision(set, currentRevision)
	if err != nil {
		return nil, err
	}
	updateSet, err := ApplyRevision(set, updateRevision)
	if err != nil {
		return nil, err
	}

	// set the generation, and revisions in the returned status
	status := apps.StatefulSetStatus{}
	status.ObservedGeneration = set.Generation
	status.CurrentRevision = currentRevision.Name
	status.UpdateRevision = updateRevision.Name
	status.CollisionCount = new(int32)
	*status.CollisionCount = collisionCount

	replicaCount := int(*set.Spec.Replicas)
	// slice that will contain all Pods such that 0 <= getOrdinal(pod) < set.Spec.Replicas
	replicas := make([]*v1.Pod, replicaCount)
	// slice that will contain all Pods such that set.Spec.Replicas <= getOrdinal(pod)
	condemned := make([]*v1.Pod, 0, len(pods))
	unhealthy := 0
	firstUnhealthyOrdinal := math.MaxInt32
	var firstUnhealthyPod *v1.Pod
    
    // 第一个for循环，将statefulset的pod分到replicas和condemned两个数组中，其中condemned数组中的pod代表需要被删除的
	// First we partition pods into two lists valid replicas and condemned Pods
	for i := range pods {
		status.Replicas++

		// count the number of running and ready replicas
		if isRunningAndReady(pods[i]) {
			status.ReadyReplicas++
		}

		// count the number of current and update replicas
		if isCreated(pods[i]) && !isTerminating(pods[i]) {
			if getPodRevision(pods[i]) == currentRevision.Name {
				status.CurrentReplicas++
			}
			if getPodRevision(pods[i]) == updateRevision.Name {
				status.UpdatedReplicas++
			}
		}

		if ord := getOrdinal(pods[i]); 0 <= ord && ord < replicaCount {
			// if the ordinal of the pod is within the range of the current number of replicas,
			// insert it at the indirection of its ordinal
			replicas[ord] = pods[i]

		} else if ord >= replicaCount {
			// if the ordinal is greater than the number of replicas add it to the condemned list
			condemned = append(condemned, pods[i])
		}
		// If the ordinal could not be parsed (ord < 0), ignore the Pod.
	}
    
    // 第二个for循环，当序号小于statefulset期望副本数值的pod未创建出来时，则根据statefulset对象中的pod模板，构建出相应序号值的pod对象（此时还并没有向apiserver发起创建pod的请求，只是构建好pod结构体）
	// for any empty indices in the sequence [0,set.Spec.Replicas) create a new Pod at the correct revision
	for ord := 0; ord < replicaCount; ord++ {
		if replicas[ord] == nil {
			replicas[ord] = newVersionedStatefulSetPod(
				currentSet,
				updateSet,
				currentRevision.Name,
				updateRevision.Name, ord)
		}
	}

	// sort the condemned Pods by their ordinals
	sort.Sort(ascendingOrdinal(condemned))
    
    // 第三个和第四个for循环，遍历replicas和condemned两个数组，找到非healthy状态的最小序号的pod记录下来，并记录序号
	// find the first unhealthy Pod
	for i := range replicas {
		if !isHealthy(replicas[i]) {
			unhealthy++
			if ord := getOrdinal(replicas[i]); ord < firstUnhealthyOrdinal {
				firstUnhealthyOrdinal = ord
				firstUnhealthyPod = replicas[i]
			}
		}
	}
	for i := range condemned {
		if !isHealthy(condemned[i]) {
			unhealthy++
			if ord := getOrdinal(condemned[i]); ord < firstUnhealthyOrdinal {
				firstUnhealthyOrdinal = ord
				firstUnhealthyPod = condemned[i]
			}
		}
	}

	if unhealthy > 0 {
		klog.V(4).Infof("StatefulSet %s/%s has %d unhealthy Pods starting with %s",
			set.Namespace,
			set.Name,
			unhealthy,
			firstUnhealthyPod.Name)
	}
    
    // 当statefulset对象的DeletionTimestamp不为nil时，直接返回前面计算出来的statefulset的新status值，不再进行方法后续的逻辑处理
	// If the StatefulSet is being deleted, don't do anything other than updating
	// status.
	if set.DeletionTimestamp != nil {
		return &status, nil
	}
    
    // 获取monotonic的值，当statefulset.Spec.PodManagementPolicy的值为Parallel时，monotonic的值为false，否则为true
	monotonic := !allowsBurst(set)
    
    // 第五个for循环，遍历replicas数组，处理statefulset的pod，主要是做pod的创建
	// Examine each replica with respect to its ordinal
	for i := range replicas {
		// delete and recreate failed pods
		if isFailed(replicas[i]) {
			ssc.recorder.Eventf(set, v1.EventTypeWarning, "RecreatingFailedPod",
				"StatefulSet %s/%s is recreating failed Pod %s",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			if err := ssc.podControl.DeleteStatefulPod(set, replicas[i]); err != nil {
				return &status, err
			}
			if getPodRevision(replicas[i]) == currentRevision.Name {
				status.CurrentReplicas--
			}
			if getPodRevision(replicas[i]) == updateRevision.Name {
				status.UpdatedReplicas--
			}
			status.Replicas--
			replicas[i] = newVersionedStatefulSetPod(
				currentSet,
				updateSet,
				currentRevision.Name,
				updateRevision.Name,
				i)
		}
		// If we find a Pod that has not been created we create the Pod
		if !isCreated(replicas[i]) {
			if err := ssc.podControl.CreateStatefulPod(set, replicas[i]); err != nil {
				return &status, err
			}
			status.Replicas++
			if getPodRevision(replicas[i]) == currentRevision.Name {
				status.CurrentReplicas++
			}
			if getPodRevision(replicas[i]) == updateRevision.Name {
				status.UpdatedReplicas++
			}

			// if the set does not allow bursting, return immediately
			if monotonic {
				return &status, nil
			}
			// pod created, no more work possible for this round
			continue
		}
		// If we find a Pod that is currently terminating, we must wait until graceful deletion
		// completes before we continue to make progress.
		if isTerminating(replicas[i]) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to Terminate",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}
		// If we have a Pod that has been created but is not running and ready we can not make progress.
		// We must ensure that all for each Pod, when we create it, all of its predecessors, with respect to its
		// ordinal, are Running and Ready.
		if !isRunningAndReady(replicas[i]) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to be Running and Ready",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}
		// Enforce the StatefulSet invariants
		if identityMatches(set, replicas[i]) && storageMatches(set, replicas[i]) {
			continue
		}
		// Make a deep copy so we don't mutate the shared cache
		replica := replicas[i].DeepCopy()
		if err := ssc.podControl.UpdateStatefulPod(updateSet, replica); err != nil {
			return &status, err
		}
	}
    
    // 第六个for循环，逆序（pod序号从大到小）遍历condemned数组，处理statefulset的pod，主要是做多余pod的删除
	// At this point, all of the current Replicas are Running and Ready, we can consider termination.
	// We will wait for all predecessors to be Running and Ready prior to attempting a deletion.
	// We will terminate Pods in a monotonically decreasing order over [len(pods),set.Spec.Replicas).
	// Note that we do not resurrect Pods in this interval. Also note that scaling will take precedence over
	// updates.
	for target := len(condemned) - 1; target >= 0; target-- {
		// wait for terminating pods to expire
		if isTerminating(condemned[target]) {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to Terminate prior to scale down",
				set.Namespace,
				set.Name,
				condemned[target].Name)
			// block if we are in monotonic mode
			if monotonic {
				return &status, nil
			}
			continue
		}
		// if we are in monotonic mode and the condemned target is not the first unhealthy Pod block
		if !isRunningAndReady(condemned[target]) && monotonic && condemned[target] != firstUnhealthyPod {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to be Running and Ready prior to scale down",
				set.Namespace,
				set.Name,
				firstUnhealthyPod.Name)
			return &status, nil
		}
		klog.V(2).Infof("StatefulSet %s/%s terminating Pod %s for scale down",
			set.Namespace,
			set.Name,
			condemned[target].Name)

		if err := ssc.podControl.DeleteStatefulPod(set, condemned[target]); err != nil {
			return &status, err
		}
		if getPodRevision(condemned[target]) == currentRevision.Name {
			status.CurrentReplicas--
		}
		if getPodRevision(condemned[target]) == updateRevision.Name {
			status.UpdatedReplicas--
		}
		if monotonic {
			return &status, nil
		}
	}
    
    // 判断statefulset的更新策略，若为OnDelete，则直接return（使用了该更新策略，则需要人工删除pod后才会重建相应序号的pod）
	// for the OnDelete strategy we short circuit. Pods will be updated when they are manually deleted.
	if set.Spec.UpdateStrategy.Type == apps.OnDeleteStatefulSetStrategyType {
		return &status, nil
	}
    
    // 获取滚动更新配置中的Partition值，当statefulset进行滚动更新时，小于等于该序号的pod将不会被更新
	// we compute the minimum ordinal of the target sequence for a destructive update based on the strategy.
	updateMin := 0
	if set.Spec.UpdateStrategy.RollingUpdate != nil {
		updateMin = int(*set.Spec.UpdateStrategy.RollingUpdate.Partition)
	}
	
	// 第七个for循环，主要是处理更新策略为RollingUpdate的statefulset对象的更新
	// we terminate the Pod with the largest ordinal that does not match the update revision.
	for target := len(replicas) - 1; target >= updateMin; target-- {

		// delete the Pod if it is not already terminating and does not match the update revision.
		if getPodRevision(replicas[target]) != updateRevision.Name && !isTerminating(replicas[target]) {
			klog.V(2).Infof("StatefulSet %s/%s terminating Pod %s for update",
				set.Namespace,
				set.Name,
				replicas[target].Name)
			err := ssc.podControl.DeleteStatefulPod(set, replicas[target])
			status.CurrentReplicas--
			return &status, err
		}

		// wait for unhealthy Pods on update
		if !isHealthy(replicas[target]) {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to update",
				set.Namespace,
				set.Name,
				replicas[target].Name)
			return &status, nil
		}

	}
	return &status, nil
}

结合上面对该方法的分析，来总结下在此方法中都有哪些步骤涉及了statefulset对象对于pod的创建、删除、扩缩容、更新操作：
1.创建：主要是（6）第五个for循环；
2.删除：主要是（7）第六个for循环；
3.扩缩容：（1）~（7）；
4.更新：主要是（8）（9）与（10）第七个for循环，其中（8）为OnDelete更新策略的处理，（9）（10）为滚动更新策略的处理。

总结

statefulset controller架构图

statefulset controller核心处理逻辑

statefulset controller的核心处理逻辑是调谐statefulset对象，从而完成statefulset对于pod的创建、删除、更新、扩缩容、statefulset的滚动更新、statefulset状态status更新、旧版本statefulset清理等操作。
在这里插入图片描述

statefulset更新策略

statefulset pod的命名规则、pod创建与删除

如果创建一个名称为web、replicas为3的statefulset对象，则其pod名称分别为web-0、web-1、web-2。

statefulset pod的创建按0 - n的顺序创建，且在创建下一个pod之前，需要等待前一个pod创建完成并处于ready状态。

statefulset滚动更新或缩容过程中pod的删除按n - 0的顺序删除，且在删除下一个pod之前，需要等待前一个pod删除完成。

K8S/Kubernetes

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐

【深度】阿里巴巴万级规模 K8s 集群全局高可用体系之美

作者 | 韩堂、柘远、沉醉来源 | 阿里巴巴云原生公众号前言台湾作家林清玄在接受记者采访的时候，如此评价自己 30 多年写作生涯：“第一个十年我才华横溢，‘贼光闪现’，令周边黯然失色；第二个十年，我终于‘宝光现形’，不再去抢风头，反而与身边的美丽相得益彰；进入第三个十年，繁华落尽见真醇，我进入了‘醇光初现’的阶段，真正体味到了境界之美”。长夜有穷，真水无香。领略过了 K8s“身在江

K8S/Kubernetes

如何基于 K8s 构建下一代 DevOps 平台？

作者 | 孙健波（天元）导读：当前云原生 DevOps 体系现状如何？面临哪些挑战？如何通过 OAM 解决云原生 DevOps 场景下的诸多问题？云原生开发应用模型 OAM(Open Application Model) 社区核心成员孙健波将为大家一一解答，并分享如何基于 OAM 和 Kubernetes 打造无限能力的下一代 DevOps 平台。什么是 DevOps？为什么基于 Kub