当集群的每个节点的配置不一样时,需要把特定功能的pod调度到指定的节点上,可以通过以下几种方式指定pod的调度方式。比如一个集群中有一个master节点,两个node节点,分别为k8s-node01和k8s-node,为k8s-node01节点加标签prod=dev,为k8s-node02节点加标签prod=sit

kubectl label node k8s-node01 prod=dev
kubectl label node k8s-node02 prod=sit

一、pod调度在固定节点

1、通过标签选择器调度pod到指定节点

在定义pod资源文件清单时,可以指定pod被调度到哪一个节点上,如下pod1.yaml所示

apiVersion: v1
kind: Pod
metadata:
  name: node-selector-pod
spec:
  nodeSelector:
    prod: sit  	#调度pod时选择标签为prod=dev的node节点
  containers:
  - image: busybox
    name: busybox
    command: ["sh", "-c", "sleep 3600"]
    imagePullPolicy: IfNotPresent

通过该yaml文件创建pod后,查看pod调度节点为k8s-node02(标签为prod=sit的节点),删除该pod,然后重新创建pod,依然调度到k8s-node02节点,如下所示,如果创建pod是指定的节点标签不存在,那么pod会一直pending状态,知道有符合标签的节点存在后变成running状态。

[root@k8s-master01 affinity_work]# kubectl get pod -o wide
NAME                                    READY   STATUS    RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES
node-selector-pod                       1/1     Running   0          10s   10.244.2.164   k8s-node02   <none>           <none>

2、通过节点名字固定调度到指定节点

apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata:
  name: my-replica
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: busybox
    spec:
      nodeName: k8s-node01   #3个副本pod全部会被调度在k8s-node01上面
      containers:
      - name: busybox
        image: busybox
        command: ["sh", "-c", "sleep 3600"]
        imagePullPolicy: IfNotPresent

上述方式通过nodeName: k8s-node01指定具体节点后,pod副本数全部会调度在k8s-node01上面。

二、通过node节点亲和性调度pod

起初,只允许在pod里面指定标签选择器来实现pod调度到哪一个节点上,随着pod调度发展,有新的机制可以指定pod的调度节点,例如本节将要说的node节点的亲和性nodeAffinity,就是致命pod更偏向于部署在什么样的节点上。节点亲和性包含2种方式:

  • requiredDuringSchedulingIgnoredDuringExecution 强制亲和性,表示一定要把pod部署在指定的标签节点上,如果指定的标签节点不存在,pod就一直显示pending状态;
  • preferredDuringSchedulingIgnoredDuringExecution 软亲和性,表示pod更倾向于部署在指定的标签节点上,如果指定的标签节点不存在,则pod就部署在其他节点上。如果要把大量pod部署在指定的标签节点上时,也会有少量的pod调度在其他节点上,否则所有pod调度在一个节点上,一旦节点异常,影响所有的pod。所以软亲和性指pod更倾向于调度在指定的节点上,而不是必须。

1、requiredDuringSchedulingIgnoredDuringExecution 用法

以下面pod2.yaml为例,创建pod,并用requiredDuringSchedulingIgnoredDuringExecution指明pod要调度在prod=dev的节点上

apiVersion: v1
kind: Pod
metadata:
  name: affinity-pod
spec:
  containers:
  - name: busybox
    image: busybox
    imagePullPolicy: IfNotPresent
    command: ["sh", "-c", "sleep 3600"]
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: prod    #指定affinity-pod要调度在含有标签prod=dev的节点上
            operator: In
            values:
            - dev

创建affinity-pod后,查看affinity-pod调度的节点,发现调度在了k8s-node01(prod=dev标签所在的节点)节点上。即使删除掉affinity-pod后,重新创建该pod,发现依然调度在k8s-node01上。

[root@k8s-master01 affinity_work]# kubectl get pod -o wide
NAME                                    READY   STATUS    RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES
affinity-pod                            1/1     Running   0          6s    10.244.1.190   k8s-node01   <none>           <none>

2、preferredDuringSchedulingIgnoredDuringExecution 用法

以pod3.yaml为例,创建ReplicaSet,并用preferredDuringSchedulingIgnoredDuringExecution指定pod调度到标签不包含prod=dev的节点上。

apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata:
  name: node-affinity-replicaset
spec:
  replicas: 3
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      containers:
      - name: busbox
        image: busybox
        command: ["sh", "-c", "sleep 3600"]
        imagePullPolicy: IfNotPresent
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 80
            preference:
              matchExpressions:
              - key: prod
                operator: NotIn
                values:
                - dev

通过该replicaset资源清单创建后,发现pod更倾向于调度到k8s-node02节点上,虽然用preferredDuringSchedulingIgnoredDuringExecution指定了pod更倾向于调度在标签不为prod=dev的节点上,但也有少量的(本案例只有1个)pod调度在了标签为prod=dev的节点k8s-node01上。

[root@k8s-master01 affinity_work]# kubectl get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES
node-affinity-replicaset-bwm49   1/1     Running   0          12s   10.244.2.165   k8s-node02   <none>           <none>
node-affinity-replicaset-k96w6   1/1     Running   0          12s   10.244.2.166   k8s-node02   <none>           <none>
node-affinity-replicaset-www2m   1/1     Running   0          12s   10.244.1.191   k8s-node01   <none>           <none>

用preferredDuringSchedulingIgnoredDuringExecution指定pod的优先亲和性调度后,会根据设置的优先级权重调度,权重越高的越容易调度,比如上述案例权重weight: 80,具有最优先权。还可以设置多个权重,按权重级别高低进行调度,如下所示,设置了两个权重,pod调度时,会根据权重的高低进行倾向性的亲和性调度。

 .....省略.....
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80   #权重高
        preference:
          matchExpressions:
          - key: prod
            operator: NotIn
            values:
            - dev
      - weight: 20 	#权重低
        preference:
          matchExpressions:
          - key: prod
            operator: In
            values:
            - uat

注意:pod在优先级高的节点上调度了,但也在优先级低的节点上调度了。原因是除了节点亲和性优先级函数,调度器还使用了其它的优先级函数来决定节点被调度的节点。比如其中之一就是Selector SpreadPriority函数,这个函数确保了属于同一个ReplicaSet或者Service的pod分散部署在不同的节点上,以避免单个节点失效导致这个服务宕机。

三、通过Pod的亲和性调度pod

上述介绍的node节点的亲和性,是根据节点的亲近关系进行调度pod的,本节介绍的Pod的亲和性是根据Pod的亲近关系进行调度的。Pod的亲和性与上述node的亲和性相似,分为requiredDuringSchedulingIgnoredDuringExecution(强制亲和性)和preferredDuringSchedulingIgnoredDuringExecution(优先级亲和性)。在介绍Pod亲和性之前先介绍下拓扑域(topologyKey)。

1、拓扑域(topologyKey)

拓扑域是指一个范围的概念,可以是一个node、一个机柜、一个机房、或者一个地区等。实际上是根据node上标签进行划分范围的,比如有3个node的标签同为prod=dev,可以认为是3个node为一个拓扑域,或者第一个机柜上的节点标签全部为pc=pc1,第二个机柜上节点标签全部为pc=pc2,则认为第一个机柜是一个拓扑域,第二个机柜是另一个拓扑域,所以是拓扑域是根据节点的标签进行划分的。如下所示,Node1、Node2、Node3属于一个拓扑域,Node4、Node5、Node6属于另一个拓扑域。
在这里插入图片描述

拓扑域是根据标签进行划分的。如下所示

2、requiredDuringSchedulingIgnoredDuringExecution用法

requiredDuringSchedulingIgnoredDuringExecution用法在pod亲和性和node亲和性中用法一致,requiredDuringSchedulingIgnoredDuringExecution用在pod亲和性中表示,pod在调度时一定要满足与已部署在node节点上pod之间的关系。如下所示

apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata:
  name: affinity-replicaset
spec:
  replicas: 3
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      containers:
      - name: busybox
        image: busybox
        command: ["sh", "-c", "sleep 3600"]
        imagePullPolicy: IfNotPresent
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution: 	#强制亲和性
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - busybox1
            topologyKey: prod

部署affinity-replicaset时会创建3个busybox的pod,并且这3个pod要满足强制亲和性,要调度在节点含有标签为app=busybox1的pod,并且,调度的节点要含有标签prod。如下所示,app=busybox1的pod在kus-node02节点上,所以创建的3个pod全部调度在了k8s-node02节点上。

[root@k8s-master01 affinity_work]# kubectl get pod -o wide --show-labels
NAME                        READY   STATUS    RESTARTS   AGE    IP             NODE         NOMINATED NODE   READINESS GATES   LABELS
affinity-replicaset-4tdfn   1/1     Running   0          19s    10.244.2.168   k8s-node02   <none>           <none>            app=busybox
affinity-replicaset-lwx88   1/1     Running   0          19s    10.244.2.169   k8s-node02   <none>           <none>            app=busybox
affinity-replicaset-vldgc   1/1     Running   0          19s    10.244.2.170   k8s-node02   <none>           <none>            app=busybox
busybox-pod                 1/1     Running   0          2m2s   10.244.2.167   k8s-node02   <none>           <none>            app=busybox1

3、preferredDuringSchedulingIgnoredDuringExecution用法

preferredDuringSchedulingIgnoredDuringExecution在pod倾向性亲和性用法与node中用法一致,表示更倾向于部署在指定节点上,但不是必须的。修改上述yaml如下

apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata:
  name: affinity-replicaset
spec:
  replicas: 3
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      containers:
      - name: busybox
        image: busybox
        command: ["sh", "-c", "sleep 3600"]
        imagePullPolicy: IfNotPresent
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution: 	#倾向性亲和性
          - weight: 20
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - busybox1
              topologyKey: prod

通过affinity-replicaset创建的3个pod部署时要满足倾向性亲和性,创建的3个pod更倾向于部署在Pod的标签为app=busybox1所在的节点上,并且所在的节点要含有prod标签。如下所示,标签为app=busybox1的pod部署在了k8s-node02节点上,新创建的3个pod调度时要满足倾向性亲和性,最终2个pod调度在了k8s-node02上,1个pod调度在了k8s-node01上,更倾向于调度在k8s-node02上。

[root@k8s-master01 affinity_work]# kubectl get pod -o wide --show-labels
NAME                        READY   STATUS    RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES   LABELS
affinity-replicaset-78jtc   1/1     Running   0          24s   10.244.1.192   k8s-node01   <none>           <none>            app=busybox
affinity-replicaset-flpwh   1/1     Running   0          24s   10.244.2.172   k8s-node02   <none>           <none>            app=busybox
affinity-replicaset-tlr2r   1/1     Running   0          24s   10.244.2.173   k8s-node02   <none>           <none>            app=busybox
busybox-pod                 1/1     Running   0          99s   10.244.2.171   k8s-node02   <none>           <none>            app=busybox1

四、通过Pod的非亲和性调度pod

pod的非亲和性用法与非亲和性用法一致,只需要把yaml中的podAffinity修改成podAntiAffinity即可。如下所示

......省略
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - busybox1
            topologyKey: prod

 ........省略
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 20
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - busybox1
          topologyKey: prod

pod反亲和性用于requiredDuringSchedulingIgnoredDuringExecution表示,pod部署一定要与requiredDuringSchedulingIgnoredDuringExecution指定的要求相反;
Pod反亲和性用于preferredDuringSchedulingIgnoredDuringExecution表示,pod部署更倾向于与preferredDuringSchedulingIgnoredDuringExecution指定的要求相反,但也有少量一致。

五、亲和性和反亲和性调度策略比较

在这里插入图片描述

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐