背景 :
本文介绍k8s调度的原理 方法与实践,后续会有部分的源码分析说明 本文所有的源码都是基于1.17版本
参考资料:
https://www.cnblogs.com/xzkzzz/p/9963511.html 感谢作者,总结的非常到位
官方文档 ,调度章节
https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/

scheduler介绍
当Scheduler通过API server 的watch接口监听到新建Pod副本的信息后,它会检查所有符合该Pod要求的Node列表,开始执行Pod调度逻辑。调度成功后将Pod绑定到目标节点上。Scheduler在整个系统中承担了承上启下的作用,承上是负责接收创建的新Pod,为安排一个落脚的地(Node),启下是安置工作完成后,目标Node上的kubelet服务进程接管后继工作,负责Pod生命周期的后半生。具体来说,Scheduler的作用是将待调度的Pod安装特定的调度算法和调度策略绑定到集群中的某个合适的Node上,并将绑定信息传给API server 写入etcd中。整个调度过程中涉及三个对象,分别是:待调度的Pod列表,可以的Node列表,以及调度算法和策略。

Kubernetes Scheduler 提供的调度流程分三步:

预选策略(predicate) 遍历nodelist,选择出符合要求的候选节点,Kubernetes内置了多种预选规则供用户选择。
优选策略(priority) 在选择出符合要求的候选节点中,采用优选规则计算出每个节点的积分,最后选择得分最高的。
选定(select) 如果最高得分有好几个节点,select就会从中随机选择一个节点。

在这里插入图片描述
预选策略
CheckNodeConditionPred 检查节点是否正常
GeneralPred HostName(如果pod定义hostname属性,会检查节点是否匹配。pod.spec.hostname)、PodFitsHostPorts(检查pod要暴露的hostpors是否被占用。pod.spec.containers.ports.hostPort)
MatchNodeSelector pod.spec.nodeSelector 看节点标签能否适配pod定义的nodeSelector
PodFitsResources 判断节点的资源能够满足Pod的定义(如果一个pod定义最少需要2C4G node上的低于此资源的将不被调度。用kubectl describe node NODE名称 可以查看资源使用情况)
NoDiskConflict 判断pod定义的存储是否在node节点上使用。(默认没有启用)
PodToleratesNodeTaints 检查pod上Tolerates的能否容忍污点(pod.spec.tolerations)
CheckNodeLabelPresence 检查节点上的标志是否存在 (默认没有启动)
CheckServiceAffinity 根据pod所属的service。将相同service上的pod尽量放到同一个节点(默认没有启动)
CheckVolumeBinding 检查是否可以绑定(默认没有启动)
NoVolumeZoneConflict 检查是否在一起区域(默认没有启动)
CheckNodeMemoryPressure 检查内存是否存在压力
CheckNodeDiskPressure 检查磁盘IO压力是否过大
CheckNodePIDPressure 检查pid资源是否过大
预选节点的源码

优选策略
least_requested 选择消耗最小的节点(根据空闲比率评估 cpu(总容量-sum(已使用)*10/总容量) )
balanced_resource_allocation 从节点列表中选出各项资源使用率最均衡的节点(CPU和内存)
node_prefer_avoid_pods 节点倾向
taint_toleration 将pod对象的spec.toleration与节点的taints列表项进行匹配度检查,匹配的条目越多,得分越低。
selector_spreading 与services上其他pod尽量不在同一个节点上,节点上通一个service的pod越少得分越高。
interpod_affinity 遍历node上的亲和性条目,匹配项越多的得分越高
most_requested 选择消耗最大的节点上(尽量将一个节点上的资源用完)
node_label 根据节点标签得分,存在标签既得分,没有标签没得分。标签越多 得分越高。
image_locality 节点上有所需要的镜像既得分,所需镜像越多得分越高。(根据已有镜像体积大小之和)
优选节点源码
https://github.com/kubernetes/kubernetes/blob/v1.17.0/pkg/scheduler/algorithm/priorities/优选节点源码地址

代码中定义的所有优选策略

package priorities

const (
	// EqualPriority defines the name of prioritizer function that gives an equal weight of one to all nodes.
	EqualPriority = "EqualPriority"
	// MostRequestedPriority defines the name of prioritizer function that gives used nodes higher priority.
	MostRequestedPriority = "MostRequestedPriority"
	// RequestedToCapacityRatioPriority defines the name of RequestedToCapacityRatioPriority.
	RequestedToCapacityRatioPriority = "RequestedToCapacityRatioPriority"
	// SelectorSpreadPriority defines the name of prioritizer function that spreads pods by minimizing
	// the number of pods (belonging to the same service or replication controller) on the same node.
	SelectorSpreadPriority = "SelectorSpreadPriority"
	// ServiceSpreadingPriority is largely replaced by "SelectorSpreadPriority".
	ServiceSpreadingPriority = "ServiceSpreadingPriority"
	// InterPodAffinityPriority defines the name of prioritizer function that decides which pods should or
	// should not be placed in the same topological domain as some other pods.
	InterPodAffinityPriority = "InterPodAffinityPriority"
	// LeastRequestedPriority defines the name of prioritizer function that prioritize nodes by least
	// requested utilization.
	LeastRequestedPriority = "LeastRequestedPriority"
	// BalancedResourceAllocation defines the name of prioritizer function that prioritizes nodes
	// to help achieve balanced resource usage.
	BalancedResourceAllocation = "BalancedResourceAllocation"
	// NodePreferAvoidPodsPriority defines the name of prioritizer function that priorities nodes according to
	// the node annotation "scheduler.alpha.kubernetes.io/preferAvoidPods".
	NodePreferAvoidPodsPriority = "NodePreferAvoidPodsPriority"
	// NodeAffinityPriority defines the name of prioritizer function that prioritizes nodes which have labels
	// matching NodeAffinity.
	NodeAffinityPriority = "NodeAffinityPriority"
	// TaintTolerationPriority defines the name of prioritizer function that prioritizes nodes that marked
	// with taint which pod can tolerate.
	TaintTolerationPriority = "TaintTolerationPriority"
	// ImageLocalityPriority defines the name of prioritizer function that prioritizes nodes that have images
	// requested by the pod present.
	ImageLocalityPriority = "ImageLocalityPriority"
	// ResourceLimitsPriority defines the nodes of prioritizer function ResourceLimitsPriority.
	ResourceLimitsPriority = "ResourceLimitsPriority"
	// EvenPodsSpreadPriority defines the name of prioritizer function that prioritizes nodes
	// which have pods and labels matching the incoming pod's topologySpreadConstraints.
	EvenPodsSpreadPriority = "EvenPodsSpreadPriority"
)

我们以image_locality为例看一下 评分是怎么计算的

// calculatePriority returns the priority of a node. Given the sumScores of requested images on the node, the node's
// priority is obtained by scaling the maximum priority value with a ratio proportional to the sumScores.
func calculatePriority(sumScores int64) int {
	if sumScores < minThreshold {
		sumScores = minThreshold
	} else if sumScores > maxThreshold {
		sumScores = maxThreshold
	}

	return int(int64(framework.MaxNodeScore) * (sumScores - minThreshold) / (maxThreshold - minThreshold))
}

// sumImageScores returns the sum of image scores of all the containers that are already on the node.
// Each image receives a raw score of its size, scaled by scaledImageScore. The raw scores are later used to calculate
// the final score. Note that the init containers are not considered for it's rare for users to deploy huge init containers.
func sumImageScores(nodeInfo *schedulernodeinfo.NodeInfo, containers []v1.Container, totalNumNodes int) int64 {
	var sum int64
	imageStates := nodeInfo.ImageStates()

	for _, container := range containers {
		if state, ok := imageStates[normalizedImageName(container.Image)]; ok {
			sum += scaledImageScore(state, totalNumNodes)
		}
	}

	return sum
}

// scaledImageScore returns an adaptively scaled score for the given state of an image.
// The size of the image is used as the base score, scaled by a factor which considers how much nodes the image has "spread" to.
// This heuristic aims to mitigate the undesirable "node heating problem", i.e., pods get assigned to the same or
// a few nodes due to image locality.
func scaledImageScore(imageState *schedulernodeinfo.ImageStateSummary, totalNumNodes int) int64 {
	spread := float64(imageState.NumNodes) / float64(totalNumNodes)
	return int64(float64(imageState.Size) * spread)
}

可以看到确实是去遍历了 node上的 image ,数量越多,得分越高

 if state, ok := imageStates[normalizedImageName(container.Image)]; ok {
			sum += scaledImageScore(state, totalNumNodes)
		}

再看一个 MostRequestedPriority 评分情况

package priorities

import framework "k8s.io/kubernetes/pkg/scheduler/framework/v1alpha1"

var (
	mostRequestedRatioResources = DefaultRequestedRatioResources
	mostResourcePriority        = &ResourceAllocationPriority{"MostResourceAllocation", mostResourceScorer, mostRequestedRatioResources}

	// MostRequestedPriorityMap is a priority function that favors nodes with most requested resources.
	// It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes
	// based on the maximum of the average of the fraction of requested to capacity.
	// Details: (cpu(10 * sum(requested) / capacity) + memory(10 * sum(requested) / capacity)) / 2
	MostRequestedPriorityMap = mostResourcePriority.PriorityMap
)

func mostResourceScorer(requested, allocable ResourceToValueMap, includeVolumes bool, requestedVolumes int, allocatableVolumes int) int64 {
	var nodeScore, weightSum int64
	for resource, weight := range mostRequestedRatioResources {
		resourceScore := mostRequestedScore(requested[resource], allocable[resource])
		nodeScore += resourceScore * weight
		weightSum += weight
	}
	return (nodeScore / weightSum)

}

// The used capacity is calculated on a scale of 0-10
// 0 being the lowest priority and 10 being the highest.
// The more resources are used the higher the score is. This function
// is almost a reversed version of least_requested_priority.calculateUnusedScore
// (10 - calculateUnusedScore). The main difference is in rounding. It was added to
// keep the final formula clean and not to modify the widely used (by users
// in their default scheduling policies) calculateUsedScore.
func mostRequestedScore(requested, capacity int64) int64 {
	if capacity == 0 {
		return 0
	}
	if requested > capacity {
		return 0
	}

	return (requested * framework.MaxNodeScore) / capacity
}

对于已经部署的服务请求,node节点对于请求的响应都有一个 请求需要多少资源, 处理请求真实分配多少资源

resourceToWeightMap ResourceToWeightMap
//包含资源名称和权重。

资源这块只关注了 内存,cpu

var DefaultRequestedRatioResources = ResourceToWeightMap{v1.ResourceMemory: 1, v1.ResourceCPU: 1}

对于volume ,这快涉及到的是pv ,如果有使用volume ,那也是加分的 ,通过这个features.BalanceAttachedNodeVolumes 特性门判断

if len(pod.Spec.Volumes) >= 0 && utilfeature.DefaultFeatureGate.Enabled(features.BalanceAttachedNodeVolumes) && nodeInfo.TransientInfo != nil {
		score = r.scorer(requested, allocatable, true, nodeInfo.TransientInfo.TransNodeInfo.RequestedVolumes, nodeInfo.TransientInfo.TransNodeInfo.AllocatableVolumesCount)
	} else {
		score = r.scorer(requested, allocatable, false, 0, 0)
	}
type ResourceToWeightMap map[v1.ResourceName]int64

将已分配的资源记录起来allocatable[resource], requested[resource]

requested := make(ResourceToValueMap, len(r.resourceToWeightMap))
	allocatable := make(ResourceToValueMap, len(r.resourceToWeightMap))
	for resource := range r.resourceToWeightMap {
		allocatable[resource], requested[resource] = calculateResourceAllocatableRequest(nodeInfo, pod, resource)
	}

之后在根据 已经分配的资源和权重计算得分

for resource, weight := range mostRequestedRatioResources {
		resourceScore := mostRequestedScore(requested[resource], allocable[resource])
		nodeScore += resourceScore * weight
		weightSum += weight
	}
	return (nodeScore / weightSum)

总之 , 分配的资源的时候 节点上部署的服务越多,mostRequestedScore 就得分越高
其他的 优选策略后续补充

最后 优选的时候会算一个总分出来,得分最高的执行本次调度

高级调度方式,亲和性反亲和性
当我们想把调度到预期的节点,我们可以使用高级调度分为:
节点选择器: nodeSelector、nodeName
节点亲和性调度: nodeAffinity
Pod亲和性调度:PodAffinity
Pod反亲和性调度:podAntiAffinity
nodeAffinity
kubectl explain pod.spec.affinity.nodeAffinity

requiredDuringSchedulingIgnoredDuringExecution 硬亲和性 必须满足亲和性。
matchExpressions 匹配表达式,这个标签可以指定一段,例如pod中定义的key为zone,operator为In(包含那些),values为 foo和bar。就是在node节点中包含foo和bar的标签中调度
matchFields 匹配字段 和上面的意思 不过他可以不定义标签值,可以定义
preferredDuringSchedulingIgnoredDuringExecution 软亲和性 能满足最好,不满足也没关系。
preference 优先级
weight 权重1-100范围内,对于满足所有调度要求的每个节点,调度程序将通过迭代此字段的元素计算总和,并在节点与对应的节点匹配时将“权重”添加到总和。
运算符包含:In,NotIn,Exists,DoesNotExist,Gt,Lt。可以使用NotIn和DoesNotExist实现节点反关联行为。
podAffinity
Pod亲和性场景,我们的k8s集群的节点分布在不同的区域或者不同的机房,当服务A和服务B要求部署在同一个区域或者同一机房的时候,我们就需要亲和性调度了。

kubectl explain pod.spec.affinity.podAffinity 和NodeAffinity是一样的,都是有硬亲和性和软亲和性

硬亲和性:

labelSelector 选择跟那组Pod亲和
namespaces 选择哪个命名空间
topologyKey 指定节点上的哪个键
污点容忍调度(Taint和Toleration)
前两种方式都是pod选择那个pod,而污点调度是node选择的pod,污点就是定义在节点上的键值属性数据。举要作用是让节点拒绝pod,拒绝不合法node规则的pod。Taint(污点)和 Toleration(容忍)是相互配合的,可以用来避免 pod 被分配到不合适的节点上,每个节点上都可以应用一个或多个 taint ,这表示对于那些不能容忍这些 taint 的 pod,是不会被该节点接受的。

Taint
Taint是节点上属性,我们看一下Taints如何定义

kubectl explain node.spec.taints(对象列表)

key 定义一个key
value 定义一个值
effect pod不能容忍这个污点时,他的行为是什么,行为分为三种:NoSchedule 仅影响调度过程,对现存的pod不影响。PreferNoSchedule 系统将尽量避免放置不容忍节点上污点的pod,但这不是必需的。就是软版的NoSchedule NoExecute 既影响调度过程,也影响现存的pod,不满足的pod将被驱逐。

亲和性 这块源码分析后面继续
本人水平有限,难免有错误的地方 ,如有发现,请留言。

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐