k8s scheduler pod调度分析
源码基于k8s 1.9 release源码目录结构plugin/cmd/kube-scheduler├── app│├── BUILD│├── server.go//schedule初始化以及运行启动函数├── BUILD├── OWNERS├── scheduler.go//schedule main函数plugin/pkg├── plugi...
源码基于k8s 1.9 release
源码目录结构
plugin/cmd/kube-scheduler
├── app
│ ├── BUILD
│ ├── server.go //schedule初始化以及运行启动函数
├── BUILD
├── OWNERS
├── scheduler.go //schedule main函数
plugin/pkg
├── plugin/pkg/admission
├── plugin/pkg/auth //相关认证
├── plugin/pkg/scheduler //schedule主要逻辑,包含预选优选算法、测量等
├── plugin/pkg/scheduler/algorithm //schedule预选优选算法
├── ...
├── plugin/pkg/scheduler/schedulercache //schedule缓存,便于业务逻辑
├── plugin/pkg/scheduler/metrics// 测量相关
├── BUILD
├── testutil.go
├── OWNERS
├── scheduler.go // scheduler的代码逻辑入口,其中scheduleOne函数就在里面
├── scheduler_test.go
在app\server.go文件中,我们看NewSchedulerCommand()函数,其中有个Run函数,这个函数通过SchedulerConfig函数获取schedule的配置,其中注册的informer包括node、persistentVolume、persistentVolumeClaim、replicationController、replicaSet、service等,watch这些资源的增删改,这些操作会去更新schedule缓存中的NodeInfo。同时根据调度算法目录去初始化调度策略。
初始化操作完成后,选举leader,当当前的节点是leader时进入调度pod的主要逻辑,函数为sched.Run()。
等待资源同步完毕后,永久的运行bindVolumesWorker和scheduleOne函数,其中bindVolumesWorker函数是给pod绑定对应的PV和PVC,scheduleOne函数每次从podQueue中获取一个pod,为未分配节点的pod寻找合适的节点,该函数是串行执行的。接下来主要分析scheduler给pod分配节点的主要流程
通览scheduleOne函数,主要是以下几个函数
NextPod() // 从podQueue中获取一个pod
schedule() // 将pod通过调度算法返回一个最优节点
metrics.SchedulingAlgorithmLatency.Observe // 记录调度耗时
// 如果schedule()找不到一个合适的节点,将在下一次重新调度,调用preempt函数
assumeAndBindVolumes // 绑定pv和pvc
assume // 给pod的nodeName赋值为优选的节点,
bind // 将assume pod更新到apiserver
metrics.E2eSchedulingLatency.Observe //scheduling algorithm + binding耗时
schedule()函数里调用Schedule()挑选出最优节点,Schedule()如下
1、UpdateNodeNameToInfoMap根据node的cache更新信息,如果node已被移除,则将map的对应节点信息删掉,如果map中不存在节点的信息,将该节点的信息集合加入map中,这些信息集合运用于后期的pod调度的逻辑判断,对于么个节点,这些信息包括
节点的node资源信息;
在该节点上的pod请求和可分配的资源总和,包括cpu、内存、gpu、容许的pod总数、存储等;
内存、磁盘压力情况;
节点上的占用端口;
Pod的亲和性;
节点taints容忍性;
2、findNodesThatFit是schedule的预选,该函数根据配置的 Predicates Policies会返回一组符合Policies的nodes,最后将这组nodes作为优选的输入。如果经过预选后返回的节点只有一个,那么直接将该节点的名称返回,如果多余1个,将继续优选。
priorityMetaProducer获取pod effect为空或者为PreferNoSchedule,将toleration加入toleration列表,获取selector与pod相关的rc、rs、service加入到selector列表。获取pod中container请求的cpu、内存大小综合,如果container未设定,cpu默认为100,内存默认为209715200。获取pod的亲和性。
关于PreferNoSchedule、NoSchedule、NoExecute的介绍如下:
对比一个 node 上所有的 taints,忽略掉和 pod 中 toleration 匹配的 taints,遗留下来未被忽略掉的所有 taints 将对 pod 产生 effect。
(a) 至少有 1 个未被忽略的 taint 且 effect 是 NoSchedule 时,则 k8s 不会将该 pod 调度到这个 node 上
(b) 不满足上述场景,但至少有 1 个未被忽略的 taint 且 effect 是 PreferNoSchedule 时,则 k8s 将尝试不把该 pod 调度到这个 node 上
(c) 至少有 1 个未被忽略的 taint 且 effect 是 NoExecute 时,则 k8s 会立即将该 pod 从该 node 上驱逐(如果已经在该 node 上运行),或着不会将该 pod 调度到这个 node 上(如果还没在这个 node 上运行)
3、PrioritizeNodes是schedule的优选,该函数根据配置的 Priorities Policies给预选后的node分别打分,选出得分最高的节点作为最后调度的节点,将pod绑定到该节点上。
findNodesThatFit函数如下
metadataProducer函数返回当前pod的predictMetadata,这些信息包括
pod的资源信息
pod的qos等级
pod的资源请求详细信息,包括cpu、内存、gpu、容许的pod总数、存储等
pod的host ports
pod的亲和性
其中在处理亲和性参数是通过一组并发量为16函数去处理每个节点。
接着并发(并发量为16)处理checkNode函数,该函数中主要的逻辑函数为podFitsOnNode,该函数检查每个节点的NodeInfo与预选算法是否符合pod在该节点上运行。
predicateFuncs包含了预选的函数,这些函数包括以下几个
(1)NoDiskConflict函数检查主机上是否存在卷冲突。如果存在pod使用了同样的数据卷,则pod不能调度到这个节点上。对于GCE, Amazon EBS, 和Ceph RBD数据卷的定义如下:
1. GCE PD允许read-only模式的pod同时挂载多个数据卷;
2. AWS EBS禁止不同pod使用挂载相同的卷;
3. Ceph RBD不允许任何两个pod共享相同的monitor、match pool和image;
4. ISCSI不允许任何练个功pod共享相同的IQN、LUN和Target;
func NoDiskConflict(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
for _, v := range pod.Spec.Volumes {
for _, ev := range nodeInfo.Pods() {
if isVolumeConflict(v, ev) {
return false, []algorithm.PredicateFailureReason{ErrDiskConflict}, nil
}
}
}
return true, nil, nil
}
(2)NoVolumeZoneConflict函数检查给定的zone限制前提下,是否存在pod的卷冲突,检测对象为PV。
func (c *VolumeZoneChecker) predicate(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
// If a pod doesn't have any volume attached to it, the predicate will always be true.
// Thus we make a fast path for it, to avoid unnecessary computations in this case.
if len(pod.Spec.Volumes) == 0 {
return true, nil, nil
}
…
namespace := pod.Namespace
manifest := &(pod.Spec)
for i := range manifest.Volumes {
volume := &manifest.Volumes[i]
if volume.PersistentVolumeClaim != nil {
pvcName := volume.PersistentVolumeClaim.ClaimName
if pvcName == "" {
return false, nil, fmt.Errorf("PersistentVolumeClaim had no name")
}
pvc, err := c.pvcInfo.GetPersistentVolumeClaimInfo(namespace, pvcName)
…
pvName := pvc.Spec.VolumeName
if pvName == "" {
if utilfeature.DefaultFeatureGate.Enabled(features.VolumeScheduling) {
scName := pvc.Spec.StorageClassName
if scName != nil && len(*scName) > 0 {
class, _ := c.classInfo.GetStorageClassInfo(*scName)
if class != nil {
if class.VolumeBindingMode == nil {
return false, nil, fmt.Errorf("VolumeBindingMode not set for StorageClass %q", scName)
}
if *class.VolumeBindingMode == storagev1.VolumeBindingWaitForFirstConsumer {
// Skip unbound volumes
continue
}
}
}
}
return false, nil, fmt.Errorf("PersistentVolumeClaim is not bound: %q", pvcName)
}
pv, err := c.pvInfo.GetPersistentVolumeInfo(pvName)
…
}
}
return true, nil, nil
}
(3)CheckNodeCondition函数检查节点状况,如果节点磁盘不是OutOfDisk,状态是ready并且网络可达就返回true;
func CheckNodeConditionPredicate(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
reasons := []algorithm.PredicateFailureReason{}
if nodeInfo == nil || nodeInfo.Node() == nil {
return false, []algorithm.PredicateFailureReason{ErrNodeUnknownCondition}, nil
}
node := nodeInfo.Node()
for _, cond := range node.Status.Conditions {
// We consider the node for scheduling only when its:
// - NodeReady condition status is ConditionTrue,
// - NodeOutOfDisk condition status is ConditionFalse,
// - NodeNetworkUnavailable condition status is ConditionFalse.
if cond.Type == v1.NodeReady && cond.Status != v1.ConditionTrue {
reasons = append(reasons, ErrNodeNotReady)
} else if cond.Type == v1.NodeOutOfDisk && cond.Status != v1.ConditionFalse {
reasons = append(reasons, ErrNodeOutOfDisk)
} else if cond.Type == v1.NodeNetworkUnavailable && cond.Status != v1.ConditionFalse {
reasons = append(reasons, ErrNodeNetworkUnavailable)
}
}
if node.Spec.Unschedulable {
reasons = append(reasons, ErrNodeUnschedulable)
}
return len(reasons) == 0, reasons, nil
}
(4)CheckNodeDiskPressure函数检查磁盘是否已经是压力状态,是则不调度pod;
func CheckNodeDiskPressurePredicate(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
// check if node is under disk pressure
if nodeInfo.DiskPressureCondition() == v1.ConditionTrue {
return false, []algorithm.PredicateFailureReason{ErrNodeUnderDiskPressure}, nil
}
return true, nil, nil
}
(5)CheckVolumeBinding函数检查pod请求的pvc,对于已经绑定的pvc,检查与其关联的pv是否满足给定的节点,对于未绑定的pvc,其试图找到一个pv满足pvc的请求并将其关联给定节点;
func (c *VolumeBindingChecker) predicate(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
if !utilfeature.DefaultFeatureGate.Enabled(features.VolumeScheduling) {
return true, nil, nil
}
node := nodeInfo.Node()
if node == nil {
return false, nil, fmt.Errorf("node not found")
}
unboundSatisfied, boundSatisfied, err := c.binder.Binder.FindPodVolumes(pod, node.Name)
if err != nil {
return false, nil, err
}
failReasons := []algorithm.PredicateFailureReason{}
if !boundSatisfied {
glog.V(5).Infof("Bound PVs not satisfied for pod %v/%v, node %q", pod.Namespace, pod.Name, node.Name)
failReasons = append(failReasons, ErrVolumeNodeConflict)
}
if !unboundSatisfied {
glog.V(5).Infof("Couldn't find matching PVs for pod %v/%v, node %q", pod.Namespace, pod.Name, node.Name)
failReasons = append(failReasons, ErrVolumeBindConflict)
}
if len(failReasons) > 0 {
return false, failReasons, nil
}
// All volumes bound or matching PVs found for all unbound PVCs
glog.V(5).Infof("All PVCs found matches for pod %v/%v, node %q", pod.Namespace, pod.Name, node.Name)
return true, nil, nil
}
(6)MatchInterPodAffinity根据pod的亲和性/反亲和性配置检查pod是否应该调度到该节点上;
func (c *PodAffinityChecker) InterPodAffinityMatches(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
node := nodeInfo.Node()
if node == nil {
return false, nil, fmt.Errorf("node not found")
}
if failedPredicates, error := c.satisfiesExistingPodsAntiAffinity(pod, meta, nodeInfo); failedPredicates != nil {
failedPredicates := append([]algorithm.PredicateFailureReason{ErrPodAffinityNotMatch}, failedPredicates)
return false, failedPredicates, error
}
// Now check if <pod> requirements will be satisfied on this node.
affinity := pod.Spec.Affinity
if affinity == nil || (affinity.PodAffinity == nil && affinity.PodAntiAffinity == nil) {
return true, nil, nil
}
if failedPredicates, error := c.satisfiesPodsAffinityAntiAffinity(pod, nodeInfo, affinity); failedPredicates != nil {
failedPredicates := append([]algorithm.PredicateFailureReason{ErrPodAffinityNotMatch}, failedPredicates)
return false, failedPredicates, error
}
if glog.V(10) {
// We explicitly don't do glog.V(10).Infof() to avoid computing all the parameters if this is
// not logged. There is visible performance gain from it.
glog.Infof("Schedule Pod %+v on Node %+v is allowed, pod (anti)affinity constraints satisfied",
podName(pod), node.Name)
}
return true, nil, nil
}
(7)MaxGCEPDVolumeCount、MaxEBSVolumeCount、MaxAzureDiskVolumeCount分别检查挂载的对应的数据卷是否超过最大值,默认分别是16、39、16。它会检查直接使用的存储卷,和间接使用这种类型存储的 PVC 。计算不同卷的总目,如果新的 Pod 部署上去后卷的数目会超过设置的最大值,那么 Pod 就不能调度到这个主机上;
func NewMaxPDVolumeCountPredicate(filterName string, pvInfo PersistentVolumeInfo, pvcInfo PersistentVolumeClaimInfo) algorithm.FitPredicate {
var filter VolumeFilter
var maxVolumes int
switch filterName {
case EBSVolumeFilterType:
filter = EBSVolumeFilter
maxVolumes = getMaxVols(aws.DefaultMaxEBSVolumes)
case GCEPDVolumeFilterType:
filter = GCEPDVolumeFilter
maxVolumes = getMaxVols(DefaultMaxGCEPDVolumes)
case AzureDiskVolumeFilterType:
filter = AzureDiskVolumeFilter
maxVolumes = getMaxVols(DefaultMaxAzureDiskVolumes)
default:
glog.Fatalf("Wrong filterName, Only Support %v %v %v ", EBSVolumeFilterType,
GCEPDVolumeFilterType, AzureDiskVolumeFilterType)
return nil
}
c := &MaxPDVolumeCountChecker{
filter: filter,
maxVolumes: maxVolumes,
pvInfo: pvInfo,
pvcInfo: pvcInfo,
randomVolumeIDPrefix: rand.String(32),
}
return c.predicate
}
(8) CheckNodeMemoryPressure函数检查当前节点是否存在压力,结合pod的qos等级;
func CheckNodeMemoryPressurePredicate(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
var podBestEffort bool
if predicateMeta, ok := meta.(*predicateMetadata); ok {
podBestEffort = predicateMeta.podBestEffort
} else {
// We couldn't parse metadata - fallback to computing it.
podBestEffort = isPodBestEffort(pod)
}
// pod is not BestEffort pod
if !podBestEffort {
return true, nil, nil
}
// check if node is under memory pressure
if nodeInfo.MemoryPressureCondition() == v1.ConditionTrue {
return false, []algorithm.PredicateFailureReason{ErrNodeUnderMemoryPressure}, nil
}
return true, nil, nil
}
(9) GeneralPredicates通过noncriticalPredicates函数检查节点是否拥有足够的资源去运行这个pod,资源包括cpu、内存、gpu等。EssentialPredicates通过PodFitsHost函数检查pod的nodeName是否适合当前节点,通过PodFitsHostPorts函数检查pod的hostPort是否未占用,通过PodMatchNodeSelector函数检查pod的nodeselector和node的label是否适配;
// noncriticalPredicates are the predicates that only non-critical pods need
func noncriticalPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
var predicateFails []algorithm.PredicateFailureReason
fit, reasons, err := PodFitsResources(pod, meta, nodeInfo)
...
return len(predicateFails) == 0, predicateFails, nil
}
// EssentialPredicates are the predicates that all pods, including critical pods, need
func EssentialPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
var predicateFails []algorithm.PredicateFailureReason
fit, reasons, err := PodFitsHost(pod, meta, nodeInfo)
...
// TODO: PodFitsHostPorts is essential for now, but kubelet should ideally
// preempt pods to free up host ports too
fit, reasons, err = PodFitsHostPorts(pod, meta, nodeInfo)
...
fit, reasons, err = PodMatchNodeSelector(pod, meta, nodeInfo)
...
return len(predicateFails) == 0, predicateFails, nil
}
(10) PodToleratesNodeTaints函数检查pod的tolerations是否符合节点的taint;
func PodToleratesNodeTaints(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
return podToleratesNodeTaints(pod, nodeInfo, func(t *v1.Taint) bool {
// PodToleratesNodeTaints is only interested in NoSchedule and NoExecute taints.
return t.Effect == v1.TaintEffectNoSchedule || t.Effect == v1.TaintEffectNoExecute
})
}
PrioritizeNodes函数如下
PrioritizeNodes通过各种优选算法计算pod在每个节点中每项优选算法的分数,最后将各个节点的分数汇总,最后再乘以每项优选算法的权重,将最终结果返回。
Schedule的优选算法分别如下
(1) BalancedResourceAllocation
选择将pod部署后,各项资源使用率更均衡的节点,必须与LeastRequestedPriority一同使用,它将计算cpu和memory的比重,并根据两个度量标准彼此之间的diff来确定主机的优先级。具体计算公式为
score = 10 - abs(cpuFraction-memoryFraction)*10
其中cpuFraction为totalRequestCPU/totalCapacityCPU,memoryFraction为totalRequestMem/totalCapacityMem。
func BalancedResourceAllocationMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
var nonZeroRequest *schedulercache.Resource
if priorityMeta, ok := meta.(*priorityMetadata); ok {
nonZeroRequest = priorityMeta.nonZeroRequest
} else {
// We couldn't parse metadatat - fallback to computing it.
nonZeroRequest = getNonZeroRequests(pod)
}
return calculateBalancedResourceAllocation(pod, nonZeroRequest, nodeInfo)
}
(2) InterPodAffinityPriority
迭代 weightedPodAffinityTerm的元素计算和,并且如果对该节点满足相应的PodAffinityTerm,则将 “weight” 加到和中,具有最高和的节点是最优选的;
(3) LeastRequestedPriority
根据pod请求的cpu和内存与节点的cpu和内存容量的比值,比值小的权重大,具体的计算公式为
cpu((capacity - sum(requested)) * 10 / capacity) + memory((capacity - sum(requested)) * 10 / capacity) / 2
其中capacity节点的容量,sum(requested)为pod中各容器请求资源总和。
func LeastRequestedPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
var nonZeroRequest *schedulercache.Resource
if priorityMeta, ok := meta.(*priorityMetadata); ok {
nonZeroRequest = priorityMeta.nonZeroRequest
} else {
// We couldn't parse metadata - fallback to computing it.
nonZeroRequest = getNonZeroRequests(pod)
}
return calculateUnusedPriority(pod, nonZeroRequest, nodeInfo)
}
(4) NodeAffinityPriority
Kubernetes调度中的亲和性机制。Node Selectors(调度时将 pod 限定在指定节点上),支持多种操作符(In、 NotIn、 Exists、DoesNotExist、 Gt、 Lt),而不限于对节点 labels 的精确匹配。另外,Kubernetes 支持两种类型的选择器,一种是 “ hard(requiredDuringSchedulingIgnoredDuringExecution)” 选择器,它保证所选的主机满足所有Pod对主机的规则要求。这种选择器更像是之前的 nodeselector,在 nodeselector 的基础上增加了更合适的表现语法。另一种 “ soft(preferresDuringSchedulingIgnoredDuringExecution)” 选择器,它作为对调度器的提示,调度器会尽量但不保证满足 NodeSelector 的所有要求。
func CalculateNodeAffinityPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
node := nodeInfo.Node()
if node == nil {
return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
}
var affinity *v1.Affinity
if priorityMeta, ok := meta.(*priorityMetadata); ok {
affinity = priorityMeta.affinity
} else {
// We couldn't parse metadata - fallback to the podspec.
affinity = pod.Spec.Affinity
}
var count int32
// A nil element of PreferredDuringSchedulingIgnoredDuringExecution matches no objects.
// An element of PreferredDuringSchedulingIgnoredDuringExecution that refers to an
// empty PreferredSchedulingTerm matches all objects.
if affinity != nil && affinity.NodeAffinity != nil && affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution != nil {
// Match PreferredDuringSchedulingIgnoredDuringExecution term by term.
for i := range affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution {
preferredSchedulingTerm := &affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution[i]
if preferredSchedulingTerm.Weight == 0 {
continue
}
// TODO: Avoid computing it for all nodes if this becomes a performance problem.
nodeSelector, err := v1helper.NodeSelectorRequirementsAsSelector(preferredSchedulingTerm.Preference.MatchExpressions)
if err != nil {
return schedulerapi.HostPriority{}, err
}
if nodeSelector.Matches(labels.Set(node.Labels)) {
count += preferredSchedulingTerm.Weight
}
}
}
return schedulerapi.HostPriority{
Host: node.Name,
Score: int(count),
}, nil
}
(5) NodePreferAvoidPodsPriority(权重分为1W)
如果pod不是被replicationController和replicaset则返回10分。获取node的anotation,判断scheduler.alpha.kubernetes.io/preferAvoidPods属性,如果存在avoid符合以下条件则返回0分,(5) NodePreferAvoidPodsPriority(权重分为1W)
如果pod不是被replicationController和replicaset则返回10分。获取node的anotation,判断scheduler.alpha.kubernetes.io/preferAvoidPods属性,如果存在avoid符合以下条件则返回0分
avoid.PodSignature.PodController.Kind == controllerRef.Kind && avoid.PodSignature.PodController.UID == controllerRef.UID,
否则返回10分。
avoid.PodSignature.PodController.Kind == controllerRef.Kind && avoid.PodSignature.PodController.UID == controllerRef.UID,
func CalculateNodePreferAvoidPodsPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
node := nodeInfo.Node()
if node == nil {
return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
}
controllerRef := priorityutil.GetControllerRef(pod)
if controllerRef != nil {
// Ignore pods that are owned by other controller than ReplicationController
// or ReplicaSet.
if controllerRef.Kind != "ReplicationController" && controllerRef.Kind != "ReplicaSet" {
controllerRef = nil
}
}
if controllerRef == nil {
return schedulerapi.HostPriority{Host: node.Name, Score: schedulerapi.MaxPriority}, nil
}
avoids, err := v1helper.GetAvoidPodsFromNodeAnnotations(node.Annotations)
if err != nil {
// If we cannot get annotation, assume it's schedulable there.
return schedulerapi.HostPriority{Host: node.Name, Score: schedulerapi.MaxPriority}, nil
}
for i := range avoids.PreferAvoidPods {
avoid := &avoids.PreferAvoidPods[i]
if avoid.PodSignature.PodController.Kind == controllerRef.Kind && avoid.PodSignature.PodController.UID == controllerRef.UID {
return schedulerapi.HostPriority{Host: node.Name, Score: 0}, nil
}
}
return schedulerapi.HostPriority{Host: node.Name, Score: schedulerapi.MaxPriority}, nil
}
(6) SelectorSpreadPriority
对于同属于一个service、RC、RS、statefulSet的pod,尽量将它们分散在不同的主机上,pod数量越少得分越高。
func (s *SelectorSpread) CalculateSpreadPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
var selectors []labels.Selector
node := nodeInfo.Node()
if node == nil {
return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
}
priorityMeta, ok := meta.(*priorityMetadata)
if ok {
selectors = priorityMeta.podSelectors
} else {
selectors = getSelectors(pod, s.serviceLister, s.controllerLister, s.replicaSetLister, s.statefulSetLister)
}
if len(selectors) == 0 {
return schedulerapi.HostPriority{
Host: node.Name,
Score: int(0),
}, nil
}
count := int(0)
for _, nodePod := range nodeInfo.Pods() {
if pod.Namespace != nodePod.Namespace {
continue
}
...
if nodePod.DeletionTimestamp != nil {
glog.V(4).Infof("skipping pending-deleted pod: %s/%s", nodePod.Namespace, nodePod.Name)
continue
}
matches := false
for _, selector := range selectors {
if selector.Matches(labels.Set(nodePod.ObjectMeta.Labels)) {
matches = true
break
}
}
if matches {
count++
}
}
return schedulerapi.HostPriority{
Host: node.Name,
Score: int(count),
}, nil
}
(7) TaintTolerationPriority
使用 Pod中 tolerationList 与 Node 节点 Taint 进行匹配,配对成功的项越多,则得分越低,类似于Predicates策略中的PodToleratesNodeTaints,优先调度到标记了Taint的节点。
func ComputeTaintTolerationPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
node := nodeInfo.Node()
if node == nil {
return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
}
// To hold all the tolerations with Effect PreferNoSchedule
var tolerationsPreferNoSchedule []v1.Toleration
if priorityMeta, ok := meta.(*priorityMetadata); ok {
tolerationsPreferNoSchedule = priorityMeta.podTolerations
} else {
tolerationsPreferNoSchedule = getAllTolerationPreferNoSchedule(pod.Spec.Tolerations)
}
return schedulerapi.HostPriority{
Host: node.Name,
Score: countIntolerableTaintsPreferNoSchedule(node.Spec.Taints, tolerationsPreferNoSchedule),
}, nil
}
(8) Schedule中还有几种未被默认使用的Priority算法,这里不一一介绍了。
更多推荐
所有评论(0)