k8s scheduler pod调度分析

源码基于k8s 1.9 release源码目录结构plugin/cmd/kube-scheduler├── app│├── BUILD│├── server.go//schedule初始化以及运行启动函数├── BUILD├── OWNERS├── scheduler.go//schedule main函数plugin/pkg├── plugi...

polarwu

6025人浏览 · 2018-08-15 17:31:17

polarwu · 2018-08-15 17:31:17 发布

源码基于k8s 1.9 release

源码目录结构

plugin/cmd/kube-scheduler
├── app  
│   ├── BUILD
│   ├── server.go  //schedule初始化以及运行启动函数
├── BUILD
├── OWNERS
├── scheduler.go   //schedule main函数
plugin/pkg
├── plugin/pkg/admission
├── plugin/pkg/auth  //相关认证
├── plugin/pkg/scheduler  //schedule主要逻辑，包含预选优选算法、测量等
├── plugin/pkg/scheduler/algorithm  //schedule预选优选算法
├── ...
├── plugin/pkg/scheduler/schedulercache //schedule缓存，便于业务逻辑
├── plugin/pkg/scheduler/metrics// 测量相关
├── BUILD
├── testutil.go    
├── OWNERS
├── scheduler.go // scheduler的代码逻辑入口，其中scheduleOne函数就在里面
├── scheduler_test.go

在app\server.go文件中，我们看NewSchedulerCommand（）函数，其中有个Run函数，这个函数通过SchedulerConfig函数获取schedule的配置，其中注册的informer包括node、persistentVolume、persistentVolumeClaim、replicationController、replicaSet、service等，watch这些资源的增删改，这些操作会去更新schedule缓存中的NodeInfo。同时根据调度算法目录去初始化调度策略。
RUN
初始化操作完成后，选举leader，当当前的节点是leader时进入调度pod的主要逻辑，函数为sched.Run()。
run
等待资源同步完毕后，永久的运行bindVolumesWorker和scheduleOne函数，其中bindVolumesWorker函数是给pod绑定对应的PV和PVC，scheduleOne函数每次从podQueue中获取一个pod，为未分配节点的pod寻找合适的节点，该函数是串行执行的。接下来主要分析scheduler给pod分配节点的主要流程
schedulOne
通览scheduleOne函数，主要是以下几个函数

NextPod()                               // 从podQueue中获取一个pod
schedule()                          // 将pod通过调度算法返回一个最优节点
metrics.SchedulingAlgorithmLatency.Observe  // 记录调度耗时
// 如果schedule()找不到一个合适的节点，将在下一次重新调度，调用preempt函数
assumeAndBindVolumes        // 绑定pv和pvc
assume                      // 给pod的nodeName赋值为优选的节点，
bind                            // 将assume pod更新到apiserver
metrics.E2eSchedulingLatency.Observe    //scheduling algorithm + binding耗时

schedule()函数里调用Schedule()挑选出最优节点，Schedule()如下
schedul
1、UpdateNodeNameToInfoMap根据node的cache更新信息，如果node已被移除，则将map的对应节点信息删掉，如果map中不存在节点的信息，将该节点的信息集合加入map中，这些信息集合运用于后期的pod调度的逻辑判断，对于么个节点，这些信息包括

节点的node资源信息；
在该节点上的pod请求和可分配的资源总和，包括cpu、内存、gpu、容许的pod总数、存储等；
内存、磁盘压力情况；
节点上的占用端口；
Pod的亲和性；
节点taints容忍性；

2、findNodesThatFit是schedule的预选，该函数根据配置的 Predicates Policies会返回一组符合Policies的nodes，最后将这组nodes作为优选的输入。如果经过预选后返回的节点只有一个，那么直接将该节点的名称返回，如果多余1个，将继续优选。
priorityMetaProducer获取pod effect为空或者为PreferNoSchedule，将toleration加入toleration列表，获取selector与pod相关的rc、rs、service加入到selector列表。获取pod中container请求的cpu、内存大小综合，如果container未设定，cpu默认为100，内存默认为209715200。获取pod的亲和性。
关于PreferNoSchedule、NoSchedule、NoExecute的介绍如下：
对比一个 node 上所有的 taints，忽略掉和 pod 中 toleration 匹配的 taints，遗留下来未被忽略掉的所有 taints 将对 pod 产生 effect。
(a) 至少有 1 个未被忽略的 taint 且 effect 是 NoSchedule 时，则 k8s 不会将该 pod 调度到这个 node 上
(b) 不满足上述场景，但至少有 1 个未被忽略的 taint 且 effect 是 PreferNoSchedule 时，则 k8s 将尝试不把该 pod 调度到这个 node 上
(c) 至少有 1 个未被忽略的 taint 且 effect 是 NoExecute 时，则 k8s 会立即将该 pod 从该 node 上驱逐（如果已经在该 node 上运行），或着不会将该 pod 调度到这个 node 上（如果还没在这个 node 上运行）
3、PrioritizeNodes是schedule的优选，该函数根据配置的 Priorities Policies给预选后的node分别打分，选出得分最高的节点作为最后调度的节点，将pod绑定到该节点上。

findNodesThatFit函数如下

metadataProducer函数返回当前pod的predictMetadata，这些信息包括

pod的资源信息
pod的qos等级
pod的资源请求详细信息，包括cpu、内存、gpu、容许的pod总数、存储等
pod的host ports
pod的亲和性

其中在处理亲和性参数是通过一组并发量为16函数去处理每个节点。
接着并发(并发量为16)处理checkNode函数，该函数中主要的逻辑函数为podFitsOnNode，该函数检查每个节点的NodeInfo与预选算法是否符合pod在该节点上运行。
fitNodes
predicateFuncs包含了预选的函数，这些函数包括以下几个
predict
(1)NoDiskConflict函数检查主机上是否存在卷冲突。如果存在pod使用了同样的数据卷，则pod不能调度到这个节点上。对于GCE, Amazon EBS, 和Ceph RBD数据卷的定义如下：
1. GCE PD允许read-only模式的pod同时挂载多个数据卷；
2. AWS EBS禁止不同pod使用挂载相同的卷；
3. Ceph RBD不允许任何两个pod共享相同的monitor、match pool和image;
4. ISCSI不允许任何练个功pod共享相同的IQN、LUN和Target；

func NoDiskConflict(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
    for _, v := range pod.Spec.Volumes {
        for _, ev := range nodeInfo.Pods() {
            if isVolumeConflict(v, ev) {
                return false, []algorithm.PredicateFailureReason{ErrDiskConflict}, nil
            }
        }
    }
    return true, nil, nil
}

(2)NoVolumeZoneConflict函数检查给定的zone限制前提下，是否存在pod的卷冲突，检测对象为PV。

func (c *VolumeZoneChecker) predicate(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
    // If a pod doesn't have any volume attached to it, the predicate will always be true.
    // Thus we make a fast path for it, to avoid unnecessary computations in this case.
    if len(pod.Spec.Volumes) == 0 {
        return true, nil, nil
    }

    …

    namespace := pod.Namespace
    manifest := &(pod.Spec)
    for i := range manifest.Volumes {
        volume := &manifest.Volumes[i]
        if volume.PersistentVolumeClaim != nil {
            pvcName := volume.PersistentVolumeClaim.ClaimName
            if pvcName == "" {
                return false, nil, fmt.Errorf("PersistentVolumeClaim had no name")
            }
            pvc, err := c.pvcInfo.GetPersistentVolumeClaimInfo(namespace, pvcName)
            …

            pvName := pvc.Spec.VolumeName
            if pvName == "" {
                if utilfeature.DefaultFeatureGate.Enabled(features.VolumeScheduling) {
                    scName := pvc.Spec.StorageClassName
                    if scName != nil && len(*scName) > 0 {
                        class, _ := c.classInfo.GetStorageClassInfo(*scName)
                        if class != nil {
                            if class.VolumeBindingMode == nil {
                                return false, nil, fmt.Errorf("VolumeBindingMode not set for StorageClass %q", scName)
                            }
                            if *class.VolumeBindingMode == storagev1.VolumeBindingWaitForFirstConsumer {
                                // Skip unbound volumes
                                continue
                            }
                        }
                    }
                }
                return false, nil, fmt.Errorf("PersistentVolumeClaim is not bound: %q", pvcName)
            }

            pv, err := c.pvInfo.GetPersistentVolumeInfo(pvName)
            …
        }
    }

    return true, nil, nil
}

(3)CheckNodeCondition函数检查节点状况，如果节点磁盘不是OutOfDisk，状态是ready并且网络可达就返回true；

func CheckNodeConditionPredicate(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
    reasons := []algorithm.PredicateFailureReason{}

    if nodeInfo == nil || nodeInfo.Node() == nil {
        return false, []algorithm.PredicateFailureReason{ErrNodeUnknownCondition}, nil
    }

    node := nodeInfo.Node()
    for _, cond := range node.Status.Conditions {
        // We consider the node for scheduling only when its:
        // - NodeReady condition status is ConditionTrue,
        // - NodeOutOfDisk condition status is ConditionFalse,
        // - NodeNetworkUnavailable condition status is ConditionFalse.
        if cond.Type == v1.NodeReady && cond.Status != v1.ConditionTrue {
            reasons = append(reasons, ErrNodeNotReady)
        } else if cond.Type == v1.NodeOutOfDisk && cond.Status != v1.ConditionFalse {
            reasons = append(reasons, ErrNodeOutOfDisk)
        } else if cond.Type == v1.NodeNetworkUnavailable && cond.Status != v1.ConditionFalse {
            reasons = append(reasons, ErrNodeNetworkUnavailable)
        }
    }

    if node.Spec.Unschedulable {
        reasons = append(reasons, ErrNodeUnschedulable)
    }

    return len(reasons) == 0, reasons, nil
}

(4)CheckNodeDiskPressure函数检查磁盘是否已经是压力状态，是则不调度pod；

func CheckNodeDiskPressurePredicate(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
    // check if node is under disk pressure
    if nodeInfo.DiskPressureCondition() == v1.ConditionTrue {
        return false, []algorithm.PredicateFailureReason{ErrNodeUnderDiskPressure}, nil
    }
    return true, nil, nil
}

(5)CheckVolumeBinding函数检查pod请求的pvc，对于已经绑定的pvc，检查与其关联的pv是否满足给定的节点，对于未绑定的pvc，其试图找到一个pv满足pvc的请求并将其关联给定节点；

func (c *VolumeBindingChecker) predicate(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
    if !utilfeature.DefaultFeatureGate.Enabled(features.VolumeScheduling) {
        return true, nil, nil
    }

    node := nodeInfo.Node()
    if node == nil {
        return false, nil, fmt.Errorf("node not found")
    }

    unboundSatisfied, boundSatisfied, err := c.binder.Binder.FindPodVolumes(pod, node.Name)
    if err != nil {
        return false, nil, err
    }

    failReasons := []algorithm.PredicateFailureReason{}
    if !boundSatisfied {
        glog.V(5).Infof("Bound PVs not satisfied for pod %v/%v, node %q", pod.Namespace, pod.Name, node.Name)
        failReasons = append(failReasons, ErrVolumeNodeConflict)
    }

    if !unboundSatisfied {
        glog.V(5).Infof("Couldn't find matching PVs for pod %v/%v, node %q", pod.Namespace, pod.Name, node.Name)
        failReasons = append(failReasons, ErrVolumeBindConflict)
    }

    if len(failReasons) > 0 {
        return false, failReasons, nil
    }

    // All volumes bound or matching PVs found for all unbound PVCs
    glog.V(5).Infof("All PVCs found matches for pod %v/%v, node %q", pod.Namespace, pod.Name, node.Name)
    return true, nil, nil
}

(6)MatchInterPodAffinity根据pod的亲和性/反亲和性配置检查pod是否应该调度到该节点上；

func (c *PodAffinityChecker) InterPodAffinityMatches(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
    node := nodeInfo.Node()
    if node == nil {
        return false, nil, fmt.Errorf("node not found")
    }
    if failedPredicates, error := c.satisfiesExistingPodsAntiAffinity(pod, meta, nodeInfo); failedPredicates != nil {
        failedPredicates := append([]algorithm.PredicateFailureReason{ErrPodAffinityNotMatch}, failedPredicates)
        return false, failedPredicates, error
    }

    // Now check if <pod> requirements will be satisfied on this node.
    affinity := pod.Spec.Affinity
    if affinity == nil || (affinity.PodAffinity == nil && affinity.PodAntiAffinity == nil) {
        return true, nil, nil
    }
    if failedPredicates, error := c.satisfiesPodsAffinityAntiAffinity(pod, nodeInfo, affinity); failedPredicates != nil {
        failedPredicates := append([]algorithm.PredicateFailureReason{ErrPodAffinityNotMatch}, failedPredicates)
        return false, failedPredicates, error
    }

    if glog.V(10) {
        // We explicitly don't do glog.V(10).Infof() to avoid computing all the parameters if this is
        // not logged. There is visible performance gain from it.
        glog.Infof("Schedule Pod %+v on Node %+v is allowed, pod (anti)affinity constraints satisfied",
            podName(pod), node.Name)
    }
    return true, nil, nil
}

(7)MaxGCEPDVolumeCount、MaxEBSVolumeCount、MaxAzureDiskVolumeCount分别检查挂载的对应的数据卷是否超过最大值，默认分别是16、39、16。它会检查直接使用的存储卷，和间接使用这种类型存储的 PVC 。计算不同卷的总目，如果新的 Pod 部署上去后卷的数目会超过设置的最大值，那么 Pod 就不能调度到这个主机上；

func NewMaxPDVolumeCountPredicate(filterName string, pvInfo PersistentVolumeInfo, pvcInfo PersistentVolumeClaimInfo) algorithm.FitPredicate {

    var filter VolumeFilter
    var maxVolumes int

    switch filterName {

    case EBSVolumeFilterType:
        filter = EBSVolumeFilter
        maxVolumes = getMaxVols(aws.DefaultMaxEBSVolumes)
    case GCEPDVolumeFilterType:
        filter = GCEPDVolumeFilter
        maxVolumes = getMaxVols(DefaultMaxGCEPDVolumes)
    case AzureDiskVolumeFilterType:
        filter = AzureDiskVolumeFilter
        maxVolumes = getMaxVols(DefaultMaxAzureDiskVolumes)
    default:
        glog.Fatalf("Wrong filterName, Only Support %v %v %v ", EBSVolumeFilterType,
            GCEPDVolumeFilterType, AzureDiskVolumeFilterType)
        return nil

    }
    c := &MaxPDVolumeCountChecker{
        filter:               filter,
        maxVolumes:           maxVolumes,
        pvInfo:               pvInfo,
        pvcInfo:              pvcInfo,
        randomVolumeIDPrefix: rand.String(32),
    }

    return c.predicate
}

(8) CheckNodeMemoryPressure函数检查当前节点是否存在压力，结合pod的qos等级；

func CheckNodeMemoryPressurePredicate(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
    var podBestEffort bool
    if predicateMeta, ok := meta.(*predicateMetadata); ok {
        podBestEffort = predicateMeta.podBestEffort
    } else {
        // We couldn't parse metadata - fallback to computing it.
        podBestEffort = isPodBestEffort(pod)
    }
    // pod is not BestEffort pod
    if !podBestEffort {
        return true, nil, nil
    }

    // check if node is under memory pressure
    if nodeInfo.MemoryPressureCondition() == v1.ConditionTrue {
        return false, []algorithm.PredicateFailureReason{ErrNodeUnderMemoryPressure}, nil
    }
    return true, nil, nil
}

(9) GeneralPredicates通过noncriticalPredicates函数检查节点是否拥有足够的资源去运行这个pod，资源包括cpu、内存、gpu等。EssentialPredicates通过PodFitsHost函数检查pod的nodeName是否适合当前节点，通过PodFitsHostPorts函数检查pod的hostPort是否未占用，通过PodMatchNodeSelector函数检查pod的nodeselector和node的label是否适配；

// noncriticalPredicates are the predicates that only non-critical pods need
func noncriticalPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
    var predicateFails []algorithm.PredicateFailureReason
    fit, reasons, err := PodFitsResources(pod, meta, nodeInfo)
    ...

    return len(predicateFails) == 0, predicateFails, nil
}

// EssentialPredicates are the predicates that all pods, including critical pods, need
func EssentialPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
    var predicateFails []algorithm.PredicateFailureReason
    fit, reasons, err := PodFitsHost(pod, meta, nodeInfo)
    ...
    // TODO: PodFitsHostPorts is essential for now, but kubelet should ideally
    //       preempt pods to free up host ports too
    fit, reasons, err = PodFitsHostPorts(pod, meta, nodeInfo)
    ...

    fit, reasons, err = PodMatchNodeSelector(pod, meta, nodeInfo)
    ...
    return len(predicateFails) == 0, predicateFails, nil
}

(10) PodToleratesNodeTaints函数检查pod的tolerations是否符合节点的taint；

func PodToleratesNodeTaints(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
    return podToleratesNodeTaints(pod, nodeInfo, func(t *v1.Taint) bool {
        // PodToleratesNodeTaints is only interested in NoSchedule and NoExecute taints.
        return t.Effect == v1.TaintEffectNoSchedule || t.Effect == v1.TaintEffectNoExecute
    })
}

PrioritizeNodes函数如下

PrioritizeNodes通过各种优选算法计算pod在每个节点中每项优选算法的分数，最后将各个节点的分数汇总，最后再乘以每项优选算法的权重，将最终结果返回。
Schedule的优选算法分别如下
(1) BalancedResourceAllocation
选择将pod部署后，各项资源使用率更均衡的节点，必须与LeastRequestedPriority一同使用，它将计算cpu和memory的比重，并根据两个度量标准彼此之间的diff来确定主机的优先级。具体计算公式为
score = 10 - abs(cpuFraction-memoryFraction)*10
其中cpuFraction为totalRequestCPU/totalCapacityCPU，memoryFraction为totalRequestMem/totalCapacityMem。

func BalancedResourceAllocationMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
    var nonZeroRequest *schedulercache.Resource
    if priorityMeta, ok := meta.(*priorityMetadata); ok {
        nonZeroRequest = priorityMeta.nonZeroRequest
    } else {
        // We couldn't parse metadatat - fallback to computing it.
        nonZeroRequest = getNonZeroRequests(pod)
    }
    return calculateBalancedResourceAllocation(pod, nonZeroRequest, nodeInfo)
}

(2) InterPodAffinityPriority
迭代 weightedPodAffinityTerm的元素计算和，并且如果对该节点满足相应的PodAffinityTerm，则将 “weight” 加到和中，具有最高和的节点是最优选的；
(3) LeastRequestedPriority
根据pod请求的cpu和内存与节点的cpu和内存容量的比值，比值小的权重大，具体的计算公式为
cpu((capacity - sum(requested)) * 10 / capacity) + memory((capacity - sum(requested)) * 10 / capacity) / 2
其中capacity节点的容量，sum(requested)为pod中各容器请求资源总和。

func LeastRequestedPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
    var nonZeroRequest *schedulercache.Resource
    if priorityMeta, ok := meta.(*priorityMetadata); ok {
        nonZeroRequest = priorityMeta.nonZeroRequest
    } else {
        // We couldn't parse metadata - fallback to computing it.
        nonZeroRequest = getNonZeroRequests(pod)
    }
    return calculateUnusedPriority(pod, nonZeroRequest, nodeInfo)
}

(4) NodeAffinityPriority
Kubernetes调度中的亲和性机制。Node Selectors（调度时将 pod 限定在指定节点上），支持多种操作符（In、 NotIn、 Exists、DoesNotExist、 Gt、 Lt），而不限于对节点 labels 的精确匹配。另外，Kubernetes 支持两种类型的选择器，一种是 “ hard（requiredDuringSchedulingIgnoredDuringExecution）” 选择器，它保证所选的主机满足所有Pod对主机的规则要求。这种选择器更像是之前的 nodeselector，在 nodeselector 的基础上增加了更合适的表现语法。另一种 “ soft（preferresDuringSchedulingIgnoredDuringExecution）” 选择器，它作为对调度器的提示，调度器会尽量但不保证满足 NodeSelector 的所有要求。

func CalculateNodeAffinityPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
    node := nodeInfo.Node()
    if node == nil {
        return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
    }

    var affinity *v1.Affinity
    if priorityMeta, ok := meta.(*priorityMetadata); ok {
        affinity = priorityMeta.affinity
    } else {
        // We couldn't parse metadata - fallback to the podspec.
        affinity = pod.Spec.Affinity
    }

    var count int32
    // A nil element of PreferredDuringSchedulingIgnoredDuringExecution matches no objects.
    // An element of PreferredDuringSchedulingIgnoredDuringExecution that refers to an
    // empty PreferredSchedulingTerm matches all objects.
    if affinity != nil && affinity.NodeAffinity != nil && affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution != nil {
        // Match PreferredDuringSchedulingIgnoredDuringExecution term by term.
        for i := range affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution {
            preferredSchedulingTerm := &affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution[i]
            if preferredSchedulingTerm.Weight == 0 {
                continue
            }

            // TODO: Avoid computing it for all nodes if this becomes a performance problem.
            nodeSelector, err := v1helper.NodeSelectorRequirementsAsSelector(preferredSchedulingTerm.Preference.MatchExpressions)
            if err != nil {
                return schedulerapi.HostPriority{}, err
            }
            if nodeSelector.Matches(labels.Set(node.Labels)) {
                count += preferredSchedulingTerm.Weight
            }
        }
    }

    return schedulerapi.HostPriority{
        Host:  node.Name,
        Score: int(count),
    }, nil
}

(5) NodePreferAvoidPodsPriority(权重分为1W)
如果pod不是被replicationController和replicaset则返回10分。获取node的anotation，判断scheduler.alpha.kubernetes.io/preferAvoidPods属性，如果存在avoid符合以下条件则返回0分,(5) NodePreferAvoidPodsPriority(权重分为1W)
如果pod不是被replicationController和replicaset则返回10分。获取node的anotation，判断scheduler.alpha.kubernetes.io/preferAvoidPods属性，如果存在avoid符合以下条件则返回0分
avoid.PodSignature.PodController.Kind == controllerRef.Kind && avoid.PodSignature.PodController.UID == controllerRef.UID，
否则返回10分。
avoid.PodSignature.PodController.Kind == controllerRef.Kind && avoid.PodSignature.PodController.UID == controllerRef.UID，

func CalculateNodePreferAvoidPodsPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
    node := nodeInfo.Node()
    if node == nil {
        return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
    }

    controllerRef := priorityutil.GetControllerRef(pod)
    if controllerRef != nil {
        // Ignore pods that are owned by other controller than ReplicationController
        // or ReplicaSet.
        if controllerRef.Kind != "ReplicationController" && controllerRef.Kind != "ReplicaSet" {
            controllerRef = nil
        }
    }
    if controllerRef == nil {
        return schedulerapi.HostPriority{Host: node.Name, Score: schedulerapi.MaxPriority}, nil
    }

    avoids, err := v1helper.GetAvoidPodsFromNodeAnnotations(node.Annotations)
    if err != nil {
        // If we cannot get annotation, assume it's schedulable there.
        return schedulerapi.HostPriority{Host: node.Name, Score: schedulerapi.MaxPriority}, nil
    }
    for i := range avoids.PreferAvoidPods {
        avoid := &avoids.PreferAvoidPods[i]
        if avoid.PodSignature.PodController.Kind == controllerRef.Kind && avoid.PodSignature.PodController.UID == controllerRef.UID {
            return schedulerapi.HostPriority{Host: node.Name, Score: 0}, nil
        }
    }
    return schedulerapi.HostPriority{Host: node.Name, Score: schedulerapi.MaxPriority}, nil
}

(6) SelectorSpreadPriority
对于同属于一个service、RC、RS、statefulSet的pod，尽量将它们分散在不同的主机上，pod数量越少得分越高。

func (s *SelectorSpread) CalculateSpreadPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
    var selectors []labels.Selector
    node := nodeInfo.Node()
    if node == nil {
        return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
    }

    priorityMeta, ok := meta.(*priorityMetadata)
    if ok {
        selectors = priorityMeta.podSelectors
    } else {
        selectors = getSelectors(pod, s.serviceLister, s.controllerLister, s.replicaSetLister, s.statefulSetLister)
    }

    if len(selectors) == 0 {
        return schedulerapi.HostPriority{
            Host:  node.Name,
            Score: int(0),
        }, nil
    }

    count := int(0)
    for _, nodePod := range nodeInfo.Pods() {
        if pod.Namespace != nodePod.Namespace {
            continue
        }
        ...
        if nodePod.DeletionTimestamp != nil {
            glog.V(4).Infof("skipping pending-deleted pod: %s/%s", nodePod.Namespace, nodePod.Name)
            continue
        }
        matches := false
        for _, selector := range selectors {
            if selector.Matches(labels.Set(nodePod.ObjectMeta.Labels)) {
                matches = true
                break
            }
        }
        if matches {
            count++
        }
    }
    return schedulerapi.HostPriority{
        Host:  node.Name,
        Score: int(count),
    }, nil
}

(7) TaintTolerationPriority
使用 Pod中 tolerationList 与 Node 节点 Taint 进行匹配，配对成功的项越多，则得分越低，类似于Predicates策略中的PodToleratesNodeTaints，优先调度到标记了Taint的节点。

func ComputeTaintTolerationPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
    node := nodeInfo.Node()
    if node == nil {
        return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
    }
    // To hold all the tolerations with Effect PreferNoSchedule
    var tolerationsPreferNoSchedule []v1.Toleration
    if priorityMeta, ok := meta.(*priorityMetadata); ok {
        tolerationsPreferNoSchedule = priorityMeta.podTolerations

    } else {
        tolerationsPreferNoSchedule = getAllTolerationPreferNoSchedule(pod.Spec.Tolerations)
    }

    return schedulerapi.HostPriority{
        Host:  node.Name,
        Score: countIntolerableTaintsPreferNoSchedule(node.Spec.Taints, tolerationsPreferNoSchedule),
    }, nil
}

(8) Schedule中还有几种未被默认使用的Priority算法，这里不一一介绍了。

K8S/Kubernetes

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐

【深度】阿里巴巴万级规模 K8s 集群全局高可用体系之美

作者 | 韩堂、柘远、沉醉来源 | 阿里巴巴云原生公众号前言台湾作家林清玄在接受记者采访的时候，如此评价自己 30 多年写作生涯：“第一个十年我才华横溢，‘贼光闪现’，令周边黯然失色；第二个十年，我终于‘宝光现形’，不再去抢风头，反而与身边的美丽相得益彰；进入第三个十年，繁华落尽见真醇，我进入了‘醇光初现’的阶段，真正体味到了境界之美”。长夜有穷，真水无香。领略过了 K8s“身在江

K8S/Kubernetes

如何基于 K8s 构建下一代 DevOps 平台？

作者 | 孙健波（天元）导读：当前云原生 DevOps 体系现状如何？面临哪些挑战？如何通过 OAM 解决云原生 DevOps 场景下的诸多问题？云原生开发应用模型 OAM(Open Application Model) 社区核心成员孙健波将为大家一一解答，并分享如何基于 OAM 和 Kubernetes 打造无限能力的下一代 DevOps 平台。什么是 DevOps？为什么基于 Kub