【 K8s 源码之调度学习】Pod 间亲和性和反亲和性的源码分析

问题 —— 为什么不梳理【现存哪些 Pod】喜欢【新 Pod】？这部分就是，将节点的得分进行累计计算，返回此符合条件的节点的得分数。这部分主要看 processExistingPod 函数。—— 因此才考虑这么详细。

oceanweave

927人浏览 · 2023-02-24 16:48:22

oceanweave · 2023-02-24 16:48:22 发布

查看案例

字段	含义
podAffinity	Pod 间的亲和性定义
podAntiAffinity	Pod 间的反亲和性定义
requiredDuringSchedulingIgnoredDuringExecution	硬性要求，必须满足条件，保证分散部署的效果最好使用用此方式
preferredDuringSchedulingIgnoredDuringExecution	软性要求，可以不完全满足，即有可能同一node上可以跑多个副本
requiredDuringSchedulingIgnoredDuringExecution	labelSelector
	topologyKey
preferredDuringSchedulingIgnoredDuringExecution	weight
	podAffinityTerm	labelSelector
		topologyKey
topologyKey	可以理解为 Node 的 Label，具有相同的 Label 的 Node，视为同一拓扑
	如三个节点打上 Label ： - Node1 —— zone：beijing - Node2 —— zone：shanghai - Node3 —— zone：beijing 那么 Node1 和 Node3 为同一拓扑，Node2 为另一拓扑
	topologyKey: kubernetes.io/hostname 上面为常见的配置，可以通过 `kubectl get nodes --show-labels`看到节点上的 Lable，就具有此 `kubernetes.io/hostname` Label 因此就是将每个节点，作为一个独立的拓扑

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  affinity:
    # 首先根据 labelSelector 选择具有 service.cpaas.io/name: deployment-nginx  Label 的 所有 Pod
    # 接下来根据 podAffinity 亲和性，将此 pod 调度到与选中 Pod 中具有 topologyKey 的 Node 上
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            service.cpaas.io/name: deployment-nginx
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchLabels:
            service.cpaas.io/name: deployment-busybox
        topologyKey: kubernetes.io/hostname
    # 首先根据 labelSelector 选择具有 key 为 a ，value为 b 或 c 的 Label 的 Pod
    # 接下来根据 podAntiAffinity，将此 pod 调度到与选中 Pod 中都不相同的 Node 上，该节点需要具有 topologyKey label
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: a
              operator: In
              values: ["b", "c"]
          topologyKey: kubernetes.io/hostname
  containers:
  - name: test-pod
    image: nginx:1.18

代码分析

代码路径：pkg/scheduler/framework/plugins/interpodaffinity

首先根据调度器框架，观察源码，可以看出实现了一下四个接口：

PreFilter
Filter
PreScore
Score

首先明确几点

该插件是考虑 Pod 间的亲和性和反亲和性（就是新Pod 和现存 Pod 的关系）
但最终结果是将 Pod 调度到合适的 Node 上（因此要记录 Node 的信息）

1 | PreFilter

此步骤作用：

梳理出【现存哪些 Pod】讨厌【新 Pod】，记录【满足条件的现存 Pod】对应 Node 信息为 existingPodAntiAffinityMap
梳理出【新 Pod】喜欢【哪些现存Pod】，记录【满足条件的现存 Pod】对应 Node 信息为 incomingPodAffinityMap
梳理出【新 Pod】讨厌【哪些现存Pod】，记录【满足条件的现存 Pod】对应 Node 信息为 incomingPodAntiAffinityMap

所以可以小总结一下

existingPodAntiAffinityMap 和 incomingPodAntiAffinityMap 这些记录的节点，新 Pod 不喜欢
incomingPodAffinityMap 记录的节点，Pod 喜欢

问题 —— 为什么不梳理【现存哪些 Pod】喜欢【新 Pod】？

因为现在是调度【新 Pod】，只要不被讨厌，不影响【现存 Pod 】就行，因此只需要可能会影响的【现存 Pod】

注意上面所说的【条件】—— 指的是【硬性要求 requiredDuringSchedulingIgnoredDuringExecution 】 —— 因此才考虑这么详细

// 这里只截取了 PreFilter 部分重要函数
// pkg/scheduler/framework/plugins/interpodaffinity/filtering.go

// 考虑现存 Pod 的 反亲和性 anti-affinity
// 简单理解：就是用现存 Pod 的 anti-affinity Terms 配置，要求 NewPod，记录下满足的 Node，说明这些节点不能调度（因为现存 Pod 排斥新 Pod）
// 这里的 anti-affinity Terms 是指 requiredDuringSchedulingIgnoredDuringExecution 定义的硬性要求
// 问题：为什么不考虑现存 Pod 的亲和性？ —— 因为现存 Pod 的亲和性（是亲和他之前 Pod），在其调度的时候早已考虑，现在只需要考虑其反感的
// 代码级理解：
// 1. 遍历所有具有 anti-affinity 现存 Pod
// 2. 若即将调度的 NewPod 满足该 Pod 的 anti-affnity Terms，
// 3. 就记录到 existingPodAntiAffinityMap 中，key 为该 Pod 所在的 node 信息（topologyKey、topologyValue），value 为满足的 Terms 次数
// 例如 map{（hostname：node01）：1}
// existingPodAntiAffinityMap will be used later for efficient check on existing pods' anti-affinity
existingPodAntiAffinityMap := getTPMapMatchingExistingAntiAffinity(pod, nodesWithRequiredAntiAffinityPods)

// 考虑新 NewPod 的亲和性和反亲和性
// 简单理解： 就是用 NewPod 的 anti-affinity 和 affinity Terms 配置，要求现存的 Pod，记录下满足的 Node
// incomingPodAffinityMap will be used later for efficient check on incoming pod's affinity
// incomingPodAntiAffinityMap will be used later for efficient check on incoming pod's anti-affinity
incomingPodAffinityMap, incomingPodAntiAffinityMap := getTPMapMatchingIncomingAffinityAntiAffinity(podInfo, allNodes)

2 | Filter

*framework.CycleState 将上面统计的信息传递过来
现在的工作就是：
传来了一个 Node 信息
判断该 Node 与上面的 existingPodAntiAffinityMap、incomingPodAntiAffinityMap 、incomingPodAffinityMap 的关系
若该 Node 满足条件，那么可以进入到下面的【打分阶段】

// pkg/scheduler/framework/plugins/interpodaffinity/filtering.go
func (pl *InterPodAffinity) Filter(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
	if nodeInfo.Node() == nil {
		return framework.NewStatus(framework.Error, "node not found")
	}

	state, err := getPreFilterState(cycleState)
	if err != nil {
		return framework.NewStatus(framework.Error, err.Error())
	}

	if !satisfyPodAffinity(state, nodeInfo) {
		return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrReasonAffinityNotMatch, ErrReasonAffinityRulesNotMatch)
	}

	if !satisfyPodAntiAffinity(state, nodeInfo) {
		return framework.NewStatus(framework.Unschedulable, ErrReasonAffinityNotMatch, ErrReasonAntiAffinityRulesNotMatch)
	}

	if !satisfyExistingPodsAntiAffinity(state, nodeInfo) {
		return framework.NewStatus(framework.Unschedulable, ErrReasonAffinityNotMatch, ErrReasonExistingAntiAffinityRulesNotMatch)
	}

	return nil
}

3 | PreScore

这部分主要看 processExistingPod 函数

可以看出根据【现存 Pod】和【新 Pod】的【软性要求preferredDuringSchedulingIgnoredDuringExecution】，对节点进行打分

// pkg/scheduler/framework/plugins/interpodaffinity/scoring.go
// PreScore builds and writes cycle state used by Score and NormalizeScore.
func (pl *InterPodAffinity) PreScore(
	pCtx context.Context,
	cycleState *framework.CycleState,
	pod *v1.Pod,
	nodes []*v1.Node,
) *framework.Status {
  
  // ... ...
	topoScores := make([]scoreMap, len(allNodes))
	index := int32(-1)
	processNode := func(i int) {
		nodeInfo := allNodes[i]
		if nodeInfo.Node() == nil {
			return
		}
		// Unless the pod being scheduled has affinity terms, we only
		// need to process pods with affinity in the node.
		podsToProcess := nodeInfo.PodsWithAffinity
		if hasAffinityConstraints || hasAntiAffinityConstraints {
			// We need to process all the pods.
			podsToProcess = nodeInfo.Pods
		}

		topoScore := make(scoreMap)
		for _, existingPod := range podsToProcess {
			pl.processExistingPod(state, existingPod, nodeInfo, pod, topoScore)
		}
		if len(topoScore) > 0 {
			topoScores[atomic.AddInt32(&index, 1)] = topoScore
		}
	}
	parallelize.Until(context.Background(), len(allNodes), processNode)

	for i := 0; i <= int(index); i++ {
		state.topologyScore.append(topoScores[i])
	}

	cycleState.Write(preScoreStateKey, state)
	return nil
}

func (pl *InterPodAffinity) processExistingPod(
	state *preScoreState,
	existingPod *framework.PodInfo,
	existingPodNodeInfo *framework.NodeInfo,
	incomingPod *v1.Pod,
	topoScore scoreMap,
) {
	existingPodNode := existingPodNodeInfo.Node()

	// For every soft pod affinity term of <pod>, if <existingPod> matches the term,
	// increment <p.counts> for every node in the cluster with the same <term.TopologyKey>
	// value as that of <existingPods>`s node by the term`s weight.
	topoScore.processTerms(state.podInfo.PreferredAffinityTerms, existingPod.Pod, existingPodNode, 1)

	// For every soft pod anti-affinity term of <pod>, if <existingPod> matches the term,
	// decrement <p.counts> for every node in the cluster with the same <term.TopologyKey>
	// value as that of <existingPod>`s node by the term`s weight.
	topoScore.processTerms(state.podInfo.PreferredAntiAffinityTerms, existingPod.Pod, existingPodNode, -1)

	// For every hard pod affinity term of <existingPod>, if <pod> matches the term,
	// increment <p.counts> for every node in the cluster with the same <term.TopologyKey>
	// value as that of <existingPod>'s node by the constant <args.hardPodAffinityWeight>
	if pl.args.HardPodAffinityWeight > 0 {
		for _, term := range existingPod.RequiredAffinityTerms {
			t := framework.WeightedAffinityTerm{AffinityTerm: term, Weight: pl.args.HardPodAffinityWeight}
			topoScore.processTerm(&t, incomingPod, existingPodNode, 1)
		}
	}

	// For every soft pod affinity term of <existingPod>, if <pod> matches the term,
	// increment <p.counts> for every node in the cluster with the same <term.TopologyKey>
	// value as that of <existingPod>'s node by the term's weight.
	topoScore.processTerms(existingPod.PreferredAffinityTerms, incomingPod, existingPodNode, 1)

	// For every soft pod anti-affinity term of <existingPod>, if <pod> matches the term,
	// decrement <pm.counts> for every node in the cluster with the same <term.TopologyKey>
	// value as that of <existingPod>'s node by the term's weight.
	topoScore.processTerms(existingPod.PreferredAntiAffinityTerms, incomingPod, existingPodNode, -1)
}

4 | Score

这部分就是，将节点的得分进行累计计算，返回此符合条件的节点的得分数

注意，所有符合条件都会调用此函数，得到自己对应的分数

// pkg/scheduler/framework/plugins/interpodaffinity/scoring.go
// Score invoked at the Score extension point.
// The "score" returned in this function is the sum of weights got from cycleState which have its topologyKey matching with the node's labels.
// it is normalized later.
// Note: the returned "score" is positive for pod-affinity, and negative for pod-antiaffinity.
func (pl *InterPodAffinity) Score(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
	nodeInfo, err := pl.sharedLister.NodeInfos().Get(nodeName)
	if err != nil || nodeInfo.Node() == nil {
		return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v, node is nil: %v", nodeName, err, nodeInfo.Node() == nil))
	}
	node := nodeInfo.Node()

	s, err := getPreScoreState(cycleState)
	if err != nil {
		return 0, framework.NewStatus(framework.Error, err.Error())
	}
	var score int64
	for tpKey, tpValues := range s.topologyScore {
		if v, exist := node.Labels[tpKey]; exist {
			score += tpValues[v]
		}
	}

	return score, nil
}

K8S/Kubernetes

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐

【深度】阿里巴巴万级规模 K8s 集群全局高可用体系之美

作者 | 韩堂、柘远、沉醉来源 | 阿里巴巴云原生公众号前言台湾作家林清玄在接受记者采访的时候，如此评价自己 30 多年写作生涯：“第一个十年我才华横溢，‘贼光闪现’，令周边黯然失色；第二个十年，我终于‘宝光现形’，不再去抢风头，反而与身边的美丽相得益彰；进入第三个十年，繁华落尽见真醇，我进入了‘醇光初现’的阶段，真正体味到了境界之美”。长夜有穷，真水无香。领略过了 K8s“身在江

K8S/Kubernetes

如何基于 K8s 构建下一代 DevOps 平台？

作者 | 孙健波（天元）导读：当前云原生 DevOps 体系现状如何？面临哪些挑战？如何通过 OAM 解决云原生 DevOps 场景下的诸多问题？云原生开发应用模型 OAM(Open Application Model) 社区核心成员孙健波将为大家一一解答，并分享如何基于 OAM 和 Kubernetes 打造无限能力的下一代 DevOps 平台。什么是 DevOps？为什么基于 Kub