Kubernetes高级调度方式

scheduler调度过程概述

scheduler在实现调度时，分为三步实现调度过程。首先是预选，从所有节点当中选择基本符合条件的节点；而后在众多符合条件的节点当中，在使用优选函数去计算各自的得分并且加以比较，并从最高得分的节点当中随机选择出一个作为运行Pod的节点。这就是控制平面当中scheduler所实现负责的主要工作。
同时如果在某些调度场景当中，我们期望通过自己的预设去影响它的一些调度方式，比如把Pod运行在一些特定的节点之上，可以通过自己的预设操作来影响scheduler的预选和优选的过程，从而使用调度操作能符合我们的期望。
此类的影响方式通常有3种，我们通常称为高级调度设置机制：
节点选择器: nodeSelector, nodeName
节点亲和调度: nodeAffinity

示例1：nodeSelector

[root@spark32 manifests]# mkdir schedule
[root@spark32 manifests]# cd schedule/
[root@spark32 schedule]# vim pod-demo.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-schedule-demo
  namespace: default
  labels:
    app: myapp
    tier: frontend
  annotations:
    wisedu.com/created-by: "cluster admin"
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
  nodeSelector:
    disktype: ssd

[root@spark32 schedule]# kubectl get nodes --show-labels
NAME       STATUS   ROLES    AGE    VERSION   LABELS
hadoop16   Ready    <none>   40d    v1.14.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=hadoop16,kubernetes.io/os=linux
spark17    Ready    <none>   157d   v1.14.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=spark17,kubernetes.io/os=linux
spark32    Ready    master   157d   v1.14.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=spark32,kubernetes.io/os=linux,node-role.kubernetes.io/master=
ubuntu31   Ready    <none>   157d   v1.14.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=ubuntu31,kubernetes.io/os=linux
[root@spark32 manifests]# kubectl get pods -o wide
NAME                READY   STATUS    RESTARTS   AGE     IP            NODE       NOMINATED NODE   READINESS GATES
pod-schedule-demo   1/1     Running   0          2m19s   10.244.2.76   ubuntu31   <none>           <none>

此前在集群中一个节点ubuntu31上打上过标签 disktype=ssd，所以这个Pod运行会运行在这个节点上。打标签和删除标签的方式如下：

[root@spark32 manifests]# kubectl label node ubuntu31 disktype=ssd
node/ubuntu31 labeled
[root@spark32 manifests]# kubectl label node ubuntu31 disktype-
node/ubuntu31 labeled

当从node节点上删除这个 disktype=ssd 标签，只要删除前pod已经运行在这个节点上，那么删除这个标签，pod依然会运行着，不会因此而终止。
现在来修改下这个pod的nodeSelector的值，需要先删除这个pod，修改完重新apply一下这个清单文件：

1
2
3

[root@spark32 schedule]# kubectl delete -f pod-demo.yaml 
pod "pod-schedule-demo" deleted
[root@spark32 schedule]# vim pod-demo.yaml

[root@spark32 schedule]# kubectl apply -f pod-demo.yaml 
pod/pod-schedule-demo created
[root@spark32 schedule]# kubectl get pods
NAME                READY   STATUS    RESTARTS   AGE
pod-schedule-demo   0/1     Pending   0          8s

此时pod一直处于pending状态。这也就意味着nodeSelector是强约束，在预选阶段就不能满足了。

[root@spark32 schedule]# kubectl describe pod pod-schedule-demo
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  10s (x3 over 79s)  default-scheduler  0/4 nodes are available: 4 node(s) didn't match node selector.

给集群中其中一个节点打上该标签，pod立马被调度到这台节点上运行了：

[root@spark32 schedule]# kubectl label node spark17 disktype=harddisk
node/spark17 labeled
[root@spark32 schedule]# kubectl get pods -o wide
NAME                READY   STATUS    RESTARTS   AGE     IP            NODE      NOMINATED NODE   READINESS GATES
pod-schedule-demo   1/1     Running   0          2m25s   10.244.1.36   spark17   <none>           <none>

1 2	[root@spark32 schedule]# kubectl delete -f pod-demo.yaml pod "pod-schedule-demo" deleted

示例2：pods.spec.affinity中nodeAffinity

[root@spark32 ~]# kubectl explain pods.spec.affinity
KIND:     Pod
VERSION:  v1
RESOURCE: affinity <Object>
DESCRIPTION:
     If specified, the pod's scheduling constraints
     Affinity is a group of affinity scheduling rules.
FIELDS:
   nodeAffinity <Object>
     Describes node affinity scheduling rules for the pod.
   podAffinity  <Object>
     Describes pod affinity scheduling rules (e.g. co-locate this pod in the
     same node, zone, etc. as some other pod(s)).
   podAntiAffinity      <Object>
     Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod
     in the same node, zone, etc. as some other pod(s)).

[root@spark32 ~]# kubectl explain pods.spec.affinity.nodeAffinity
KIND:     Pod
VERSION:  v1
RESOURCE: nodeAffinity <Object>
DESCRIPTION:
     Describes node affinity scheduling rules for the pod.
     Node affinity is a group of node affinity scheduling rules.
FIELDS:
   preferredDuringSchedulingIgnoredDuringExecution      <[]Object>
     The scheduler will prefer to schedule pods to nodes that satisfy the
     affinity expressions specified by this field, but it may choose a node that
     violates one or more of the expressions. The node that is most preferred is
     the one with the greatest sum of weights, i.e. for each node that meets all
     of the scheduling requirements (resource request, requiredDuringScheduling
     affinity expressions, etc.), compute a sum by iterating through the
     elements of this field and adding "weight" to the sum if the node matches
     the corresponding matchExpressions; the node(s) with the highest sum are
     the most preferred.
   requiredDuringSchedulingIgnoredDuringExecution       <Object>
     If the affinity requirements specified by this field are not met at
     scheduling time, the pod will not be scheduled onto the node. If the
     affinity requirements specified by this field cease to be met at some point
     during pod execution (e.g. due to an update), the system may or may not try
     to eventually evict the pod from its node.

preferredDuringSchedulingIgnoredDuringExecution <[]Object>
倾向，尽量满足的条件。不满足也行。尽量运行在满足这里定义的亲和条件的节点上。
requiredDuringSchedulingIgnoredDuringExecution

[root@spark32 ~]# kubectl explain pods.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution
KIND:     Pod
VERSION:  v1
RESOURCE: requiredDuringSchedulingIgnoredDuringExecution <Object>
DESCRIPTION:
     If the affinity requirements specified by this field are not met at
     scheduling time, the pod will not be scheduled onto the node. If the
     affinity requirements specified by this field cease to be met at some point
     during pod execution (e.g. due to an update), the system may or may not try
     to eventually evict the pod from its node.
     A node selector represents the union of the results of one or more label
     queries over a set of nodes; that is, it represents the OR of the selectors
     represented by the node selector terms.
FIELDS:
   nodeSelectorTerms    <[]Object> -required-
     Required. A list of node selector terms. The terms are ORed.
You have mail in /var/spool/mail/root
[root@spark32 ~]# kubectl explain pods.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms
KIND:     Pod
VERSION:  v1
RESOURCE: nodeSelectorTerms <[]Object>
DESCRIPTION:
     Required. A list of node selector terms. The terms are ORed.
     A null or empty node selector term matches no objects. The requirements of
     them are ANDed. The TopologySelectorTerm type implements a subset of the
     NodeSelectorTerm.
FIELDS:
   matchExpressions     <[]Object>
     A list of node selector requirements by node's labels.
   matchFields  <[]Object>
     A list of node selector requirements by node's fields.

matchExpressions <[]Object>
A list of node selector requirements by node’s labels. 匹配表达式的
matchFields <[]Object>
A list of node selector requirements by node’s fields. 匹配字段的

下面定义示例yaml文件：

[root@spark32 schedule]# vim pod-nodeaffinity.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeaffinity-demo
  namespace: default
  labels:
    app: myapp
    tier: frontend
  annotations:
    wisedu.com/created-by: "cluster admin"
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: zone
            operator: In
            values: ["foo", "bar"]
[root@spark32 schedule]# kubectl apply -f pod-nodeaffinity.yaml 
pod/pod-nodeaffinity-demo created

清单中定义的是硬亲和性，目前还没有节点拥有标签zone，并且值在foo和bar内。

1
2
3

[root@spark32 schedule]# kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
pod-nodeaffinity-demo   0/1     Pending   0          5s

找一个集群中的节点打上一个标签 zone=bar:

[root@spark32 schedule]# kubectl label node hadoop16 zone=bar
node/hadoop16 labeled
[root@spark32 schedule]# kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
pod-nodeaffinity-demo    1/1     Running   0          11m

删除标签和pod：

[root@spark32 schedule]# kubectl label node hadoop16 zone-
node/hadoop16 labeled
[root@spark32 schedule]# kubectl delete -f pod-nodeaffinity.yaml 
pod "pod-nodeaffinity-demo" deleted

下面定义一个软亲和性的清单文件（即使定义的条件不满足，也会勉为其难地找一个节点运行pod）：

[root@spark32 schedule]# vim pod-nodeaffinity-demo2.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeaffinity-demo2
  namespace: default
  labels:
    app: myapp
    tier: frontend
  annotations:
    wisedu.com/created-by: "cluster admin"
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: zone
            operator: In
            values: ["foo", "bar"]
        weight: 60
[root@spark32 schedule]# kubectl apply -f pod-nodeaffinity-demo2.yaml 
pod/pod-nodeaffinity-demo2 created
[root@spark32 schedule]# kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
pod-nodeaffinity-demo2   1/1     Running   0          6s

删除这个pod：

1 2	[root@spark32 schedule]# kubectl delete -f pod-nodeaffinity-demo2.yaml pod "pod-nodeaffinity-demo2" deleted

示例3：pods.spec.affinity中的podAffinity和podAntiAffinity

一般是出于高效通信的需求，偶尔需要把一些pod对象组织在相近的位置，比如运行在同一节点，同一机架，同一区域，同一地区等等，这样子pod和pod通信效率更高。
主要目的是把pod运行在一起，或不运行在一起，其实通过节点亲和性就能达到目的。比如3个Pod，NMP，使用3个同样的节点标签，而后我们在节点上打标签的时候就确保它3个选择的标签的节点就在同一个位置，就在同一个机架上，也能达到这个目的。但是为何还要定义pod亲和性和反亲和性呢？是因为使用节点亲和性去限制pod，它不是一种较优的选择方式，需要精心布局节点是被打上什么标签的才能实现目的。这种方式使用起来可能难度较大。
较理想的方式是，允许调度器把第一个Pod随机选择一个位置，但是第二个Pod就要根据第一个pod所在的位置来进行调度。Pod的亲和性并不强制一定要在同一个节点，相近的就可以了。所以要定义什么叫同一位置，什么叫不同位置。
如何判定是哪些节点是相同位置，哪些节点是不同位置是需要定义的。比如现在有4个正常运行的主机，当第一个Pod运行在第一个节点上之后，如何判定第2、3、4节点是否可以运行与第一个Pod亲和的Pod？比如NMP是亲和的，N被放在了第一个节点，M是应该放在第一个节点上，还是其他节点都不能放？Pod的亲和性并不强制一定要在同一个节点，非要把NMP放在同一个节点，假如这个节点资源不够，这样也不是一个最佳选择。比如我们定义，把N、M、P分别运行在一个节点上，只要这3个节点在一个机柜内，那就认为这是满足亲和条件的。所以在定义Pod亲和性时必须有个判断前提，Pod和Pod要在同一位置和不要在同一位置的判断标准是什么。因此什么叫同一位置，什么叫不同位置就很关键了。当以节点名称来判定这几个节点是不是同一位置，很显然这4个节点都是不同的位置。节点名相同的就认为是同一位置，不同的就认为是不同位置。所以如果把N这个Pod运行在节点1上，M和P也是会被调度到节点1上的。
换个判定标准，比如判定是否是同一位置的标准是：节点标签rack(机架)相同的就是同一位置。比如rack=rack1，两个节点都有这个标签，并且值为rack1的，那么两个节点就是同一位置。另外两个节点的标签是rack=rack2。所以此时假如第一个Pod运行在第一个节点上，也就是rack=rack1的节点上，那么M和P可以运行在第一个和第二个节点上。

podAffinity

Pod亲和性也有硬亲和性和软亲和性。

[root@spark32 manifests]# kubectl explain pods.spec.affinity.podAffinity
KIND:     Pod
VERSION:  v1
RESOURCE: podAffinity <Object>
DESCRIPTION:
     Describes pod affinity scheduling rules (e.g. co-locate this pod in the
     same node, zone, etc. as some other pod(s)).
     Pod affinity is a group of inter pod affinity scheduling rules.
FIELDS:
   preferredDuringSchedulingIgnoredDuringExecution      <[]Object>
     The scheduler will prefer to schedule pods to nodes that satisfy the
     affinity expressions specified by this field, but it may choose a node that
     violates one or more of the expressions. The node that is most preferred is
     the one with the greatest sum of weights, i.e. for each node that meets all
     of the scheduling requirements (resource request, requiredDuringScheduling
     affinity expressions, etc.), compute a sum by iterating through the
     elements of this field and adding "weight" to the sum if the node has pods
     which matches the corresponding podAffinityTerm; the node(s) with the
     highest sum are the most preferred.
   requiredDuringSchedulingIgnoredDuringExecution       <[]Object>
     If the affinity requirements specified by this field are not met at
     scheduling time, the pod will not be scheduled onto the node. If the
     affinity requirements specified by this field cease to be met at some point
     during pod execution (e.g. due to a pod label update), the system may or
     may not try to eventually evict the pod from its node. When there are
     multiple elements, the lists of nodes corresponding to each podAffinityTerm
     are intersected, i.e. all terms must be satisfied.

[root@spark32 manifests]# kubectl explain pods.spec.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution
KIND:     Pod
VERSION:  v1
RESOURCE: requiredDuringSchedulingIgnoredDuringExecution <[]Object>
DESCRIPTION:
     If the affinity requirements specified by this field are not met at
     scheduling time, the pod will not be scheduled onto the node. If the
     affinity requirements specified by this field cease to be met at some point
     during pod execution (e.g. due to a pod label update), the system may or
     may not try to eventually evict the pod from its node. When there are
     multiple elements, the lists of nodes corresponding to each podAffinityTerm
     are intersected, i.e. all terms must be satisfied.
     Defines a set of pods (namely those matching the labelSelector relative to
     the given namespace(s)) that this pod should be co-located (affinity) or
     not co-located (anti-affinity) with, where co-located is defined as running
     on a node whose value of the label with key <topologyKey> matches that of
     any node on which a pod of the set of pods is running
FIELDS:
   labelSelector        <Object>
     A label query over a set of resources, in this case pods.
   namespaces   <[]string>
     namespaces specifies which namespaces the labelSelector applies to (matches
     against); null or empty list means "this pod's namespace"
   topologyKey  <string> -required-
     This pod should be co-located (affinity) or not co-located (anti-affinity)
     with the pods matching the labelSelector in the specified namespaces, where
     co-located is defined as running on a node whose value of the label with
     key topologyKey matches that of any node on which any of the selected pods
     is running. Empty topologyKey is not allowed.

labelSelector
namespaces <[]string>
namespaces specifies which namespaces the labelSelector applies to (matches against); null or empty list means “this pod’s namespace” 指明labelSelector匹配到一组pod到底是哪个名称空间的，不指意味着与要创建的新pod一个名称空间的
topologyKey -required-
位置拓扑的键，用来判定是不是同一个位置。用哪个键来判定是不是同一位置

接下来定义两个pod，第一个是基准，第二个跟第一个走。
每一个节点都有个标签叫 kubernetes.io/hostname。

[root@spark32 manifests]# kubectl get nodes --show-labels
NAME       STATUS   ROLES    AGE    VERSION   LABELS
hadoop16   Ready    <none>   40d    v1.14.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=hadoop16,kubernetes.io/os=linux
spark17    Ready    <none>   157d   v1.14.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=harddisk,kubernetes.io/arch=amd64,kubernetes.io/hostname=spark17,kubernetes.io/os=linux
spark32    Ready    master   157d   v1.14.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=spark32,kubernetes.io/os=linux,node-role.kubernetes.io/master=
ubuntu31   Ready    <none>   157d   v1.14.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=ubuntu31,kubernetes.io/os=linux

[root@spark32 schedule]# vim pod-podaffinity-demo.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: pod-podaffinity-first
  labels:
    app: myapp
    tier: frontend
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-podaffinity-second
  namespace: default
  labels:
    app: backend
    tier: db
spec:
  containers:
  - name: busybox
    image: busybox:latest
    imagePullPolicy: IfNotPresent
    command: ["sh", "-c", "sleep 3600"]
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          #- {key: app, operator: In, values: ["myapp"]}
          - key: app
            operator: In
            values: ["myapp"]
        topologyKey: kubernetes.io/hostname

[root@spark32 schedule]# kubectl apply -f pod-podaffinity-demo.yaml 
pod/pod-podaffinity-first created
pod/pod-podaffinity-second created
[root@spark32 schedule]# kubectl get pods -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
pod-podaffinity-first    1/1     Running   0          7s    10.244.2.77   ubuntu31   <none>           <none>
pod-podaffinity-second   1/1     Running   0          6s    10.244.2.78   ubuntu31   <none>           <none>

两个pod都运行在第二个节点上。
删除这两个pod：

1
2
3

[root@spark32 schedule]# kubectl delete -f pod-podaffinity-demo.yaml 
pod "pod-podaffinity-first" deleted
pod "pod-podaffinity-second" deleted

podAntiAffinity

podAffinity和podAntiAffinity区别是：二者的标签的值不能是相同的。

1	[root@spark32 schedule]# vim pod-podantiaffinity-demo.yaml

[root@spark32 schedule]# kubectl apply -f pod-podantiaffinity-demo.yaml
pod/pod-podaffinity-first created
pod/pod-podaffinity-second created
[root@spark32 schedule]# kubectl get pods -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
pod-podaffinity-first    1/1     Running   0          7s    10.244.2.79   ubuntu31   <none>           <none>
pod-podaffinity-second   1/1     Running   0          7s    10.244.1.40   spark17    <none>           <none>

一定不在同一个节点上。

1
2
3

[root@spark32 schedule]# kubectl delete -f pod-podantiaffinity-demo.yaml
pod "pod-podaffinity-first" deleted
pod "pod-podaffinity-second" deleted

将集群中的三个node节点，都打上同一个标签 zone=foo，其中spark32为master节点：

[root@spark32 schedule]# kubectl get nodes
NAME       STATUS   ROLES    AGE    VERSION
hadoop16   Ready    <none>   40d    v1.14.1
spark17    Ready    <none>   157d   v1.14.1
spark32    Ready    master   157d   v1.14.1
ubuntu31   Ready    <none>   157d   v1.14.1
[root@spark32 schedule]# kubectl label node spark17 zone=foo
node/spark17 labeled
[root@spark32 schedule]# kubectl label node ubuntu31 zone=foo
node/ubuntu31 labeled
[root@spark32 schedule]# kubectl label node hadoop16 zone=foo
node/hadoop16 labeled

修改 pod-podantiaffinity-demo.yaml 中topologyKey的值：

[root@spark32 schedule]# kubectl apply -f pod-podantiaffinity-demo.yaml 
pod/pod-podaffinity-first created
pod/pod-podaffinity-second created
[root@spark32 schedule]# kubectl get pods -o wide                                    
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
pod-podaffinity-first    1/1     Running   0          2s    10.244.2.81   ubuntu31   <none>           <none>
pod-podaffinity-second   0/1     Pending   0          2s    <none>        <none>     <none>           <none>

第一个Pod在运行，第二个Pod处于pending状态。

1
2
3

[root@spark32 schedule]# kubectl delete -f pod-podantiaffinity-demo.yaml     
pod "pod-podaffinity-first" deleted
pod "pod-podaffinity-second" deleted

示例4：污点调度

污点就是定义在节点上的键值数据。键值数据有三类：标签、注解、污点。污点是运行在节点上的，不像标签和注解，所有资源都能用。
这就给了节点选择权，给节点打一些污点，pod不容忍就不能运行上来。我们需要在Pod上定义容忍度。容忍度tolerations是Pod对象上的第三种键值数据。

[root@spark32 schedule]# kubectl explain nodes
[root@spark32 schedule]# kubectl explain nodes.spec
[root@spark32 schedule]# kubectl explain nodes.spec.taints
KIND:     Node
VERSION:  v1
RESOURCE: taints <[]Object>
DESCRIPTION:
     If specified, the node's taints.
     The node this Taint is attached to has the "effect" on any pod that does
     not tolerate the Taint.
FIELDS:
   effect       <string> -required-
     Required. The effect of the taint on pods that do not tolerate the taint.
     Valid effects are NoSchedule, PreferNoSchedule and NoExecute.
   key  <string> -required-
     Required. The taint key to be applied to a node.
   timeAdded    <string>
     TimeAdded represents the time at which the taint was added. It is only
     written for NoExecute taints.
   value        <string>
     Required. The taint value corresponding to the taint key.

effect -required-
当Pod不能容忍这个污点时，要采取的行为是什么。taint的effect定义对Pod排斥效果:
NoSchedule:仅影响调度过程,对现存的Pod对象不产生影响;
NoExecute:既影响调度过程,也影响现在的Pod对象;不容忍的Pod对象将被驱逐;
PreferNoSchedule: 不能容忍，但是没地方运行也可以过来运行。最好不，表示也可以。

在Pod对象上定义容忍度的时候，还支持两种操作。等值比较和存在性判断。所谓等值比较，需要在key、value、effect上完全匹配。存在性判断表示二者的key和effect必须匹配，但是value可以使用空值，即判断存在不存在与否即可。一个节点可以配置多个污点，一个Pod也可以有多个容忍度。只不过二者匹配时要遵循如下的逻辑。
比如在Pod上定义了3个容忍度，在节点之上定义了2个污点，这个pod一定能运行在这个节点上吗？不一定，pod容忍了其中一个污点，另外一个没有容忍，这种情况是可能的。要逐一检查节点的污点，节点的每一个污点都必须被Pod容忍。如果某个污点被Pod的容忍度匹配到了，那么这个污点就过了，检查下一个。如果存在污点不被pod所容忍，就要看这个污点的条件了。如果这个污点的行为是PreferNoSchedule，那么事实上还是可以运行在这个节点上的。但是如果这个污点的行为是NoSchedule，就一定不能被调度到这个节点上了。
此前在运行pod时，没有一个pod会被调度到master节点上运行，是因为master上默认就有污点，我们定义的Pod都没有去定义容忍度去匹配master节点上的这个污点。master是用来运行集群控制平面组件的，可以看看这几个Pod，肯定定义了容忍度去匹配这个污点。

master节点上打的污点，value为空值，即判断表示存在不存在。
看下api-server这个pod中定义的容忍度：

1	[root@spark32 manifests]# kubectl get pod kube-apiserver-spark32 -n kube-system -o yaml

管理节点的污点：

[root@spark32 schedule]# kubectl taint --help
Update the taints on one or more nodes.
  *  A taint consists of a key, value, and effect. As an argument here, it is expressed as key=value:effect.
  *  The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores, up to
253 characters.
  *  Optionally, the key can begin with a DNS subdomain prefix and a single '/', like example.com/my-app
  *  The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores, up
to  63 characters.
  *  The effect must be NoSchedule, PreferNoSchedule or NoExecute.
  *  Currently taint can only apply to node.
Examples:
  # Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoSchedule'.
  # If a taint with that key and effect already exists, its value is replaced as specified.
  kubectl taint nodes foo dedicated=special-user:NoSchedule
  
  # Remove from node 'foo' the taint with key 'dedicated' and effect 'NoSchedule' if one exists.
  kubectl taint nodes foo dedicated:NoSchedule-
  
  # Remove from node 'foo' all the taints with key 'dedicated'
  kubectl taint nodes foo dedicated-
  
  # Add a taint with key 'dedicated' on nodes having label mylabel=X
  kubectl taint node -l myLabel=X  dedicated=foo:PreferNoSchedule
Options:
      --all=false: Select all nodes in the cluster
      --allow-missing-template-keys=true: If true, ignore any errors in templates when a field or map key is missing in
the template. Only applies to golang and jsonpath output formats.
  -o, --output='': Output format. One of:
json|yaml|name|go-template|go-template-file|template|templatefile|jsonpath|jsonpath-file.
      --overwrite=false: If true, allow taints to be overwritten, otherwise reject taint updates that overwrite existing
taints.
  -l, --selector='': Selector (label query) to filter on, supports '=', '==', and '!='.(e.g. -l key1=value1,key2=value2)
      --template='': Template string or path to template file to use when -o=go-template, -o=go-template-file. The
template format is golang templates [http://golang.org/pkg/text/template/#pkg-overview].
      --validate=true: If true, use a schema to validate the input before sending it
Usage:
  kubectl taint NODE NAME KEY_1=VAL_1:TAINT_EFFECT_1 ... KEY_N=VAL_N:TAINT_EFFECT_N [options]
Use "kubectl options" for a list of global command-line options (applies to all commands).

现在集群中有3个node节点，给其中两个node节点打上污点如下：

[root@spark32 schedule]# kubectl taint node spark17 node-type=production:NoSchedule         
node/spark17 tainted
[root@spark32 schedule]# kubectl taint node ubuntu31 node-type=production:NoSchedule 
node/ubuntu31 tainted
[root@spark32 schedule]# kubectl get node ubuntu31 -o yaml
...
spec:
  podCIDR: 10.244.2.0/24
  taints:
  - effect: NoSchedule
    key: node-type
    value: production
...
[root@spark32 schedule]# kubectl describe node ubuntu31
...
Taints:             node-type=production:NoSchedule
...

定义一个Deployment，不定义其中Pod的污点容忍度：

[root@spark32 schedule]# vim deploy-taint-demo.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-deploy
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      release: canary
  template:
    metadata:
      labels:
        app: myapp
        release: canary
    spec:
      containers:
      - name: myapp
        image: ikubernetes/myapp:v2
        imagePullPolicy: IfNotPresent
        ports:
        - name: http
          containerPort: 80

[root@spark32 schedule]# kubectl apply -f deploy-taint-demo.yaml 
deployment.apps/myapp-deploy created
[root@spark32 schedule]# kubectl get pods -o wide
NAME                            READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
myapp-deploy-675558bfc5-99xj9   1/1     Running   0          7s    10.244.3.18   hadoop16   <none>           <none>
myapp-deploy-675558bfc5-scqtz   1/1     Running   0          7s    10.244.3.19   hadoop16   <none>           <none>
myapp-deploy-675558bfc5-wsrdp   1/1     Running   0          7s    10.244.3.17   hadoop16   <none>           <none>

这样这个Deployment中的Pod都运行在没打污点的那个节点上。

将没打污点的节点hadoop16也打上污点，并且effect为NoExecute：

[root@spark32 schedule]# kubectl taint node hadoop16 node-type=production:NoExecute
node/hadoop16 tainted
[root@spark32 schedule]# kubectl get pods -o wide
NAME                            READY   STATUS        RESTARTS   AGE     IP            NODE       NOMINATED NODE   READINESS GATES
myapp-deploy-675558bfc5-99xj9   1/1     Terminating   0          2m38s   10.244.3.18   hadoop16   <none>           <none>
myapp-deploy-675558bfc5-bl5nk   0/1     Pending       0          2s      <none>        <none>     <none>           <none>
myapp-deploy-675558bfc5-fdmq2   0/1     Pending       0          2s      <none>        <none>     <none>           <none>
myapp-deploy-675558bfc5-scqtz   1/1     Terminating   0          2m38s   10.244.3.19   hadoop16   <none>           <none>
myapp-deploy-675558bfc5-wsrdp   1/1     Terminating   0          2m38s   10.244.3.17   hadoop16   <none>           <none>
myapp-deploy-675558bfc5-xd98m   0/1     Pending       0          2s      <none>        <none>     <none>           <none>
[root@spark32 schedule]# kubectl get pods -o wide
NAME                            READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES
myapp-deploy-675558bfc5-bl5nk   0/1     Pending   0          18s   <none>   <none>   <none>           <none>
myapp-deploy-675558bfc5-fdmq2   0/1     Pending   0          18s   <none>   <none>   <none>           <none>
myapp-deploy-675558bfc5-xd98m   0/1     Pending   0          18s   <none>   <none>   <none>           <none>

运行在hadoop16上的pod被驱逐了。

下面看看如何在pod上定义tolerations。

[root@spark32 schedule]# kubectl explain pods.spec.tolerations
KIND:     Pod
VERSION:  v1
RESOURCE: tolerations <[]Object>
DESCRIPTION:
     If specified, the pod's tolerations.
     The pod this Toleration is attached to tolerates any taint that matches the
     triple <key,value,effect> using the matching operator <operator>.
FIELDS:
   effect       <string>
     Effect indicates the taint effect to match. Empty means match all taint
     effects. When specified, allowed values are NoSchedule, PreferNoSchedule
     and NoExecute.
   key  <string>
     Key is the taint key that the toleration applies to. Empty means match all
     taint keys. If the key is empty, operator must be Exists; this combination
     means to match all values and all keys.
   operator     <string>
     Operator represents a key's relationship to the value. Valid operators are
     Exists and Equal. Defaults to Equal. Exists is equivalent to wildcard for
     value, so that a pod can tolerate all taints of a particular category.
   tolerationSeconds    <integer>
     TolerationSeconds represents the period of time the toleration (which must
     be of effect NoExecute, otherwise this field is ignored) tolerates the
     taint. By default, it is not set, which means tolerate the taint forever
     (do not evict). Zero and negative values will be treated as 0 (evict
     immediately) by the system.
   value        <string>
     Value is the taint value the toleration matches to. If the operator is
     Exists, the value should be empty, otherwise just a regular string.

tolerationSeconds：被驱逐时可以等待多久被驱逐，默认是0，立即驱逐。

修改deploy-taint-demo.yaml，定义tolerations：

1	[root@spark32 schedule]# vim deploy-taint-demo.yaml

[root@spark32 schedule]# kubectl apply -f deploy-taint-demo.yaml 
deployment.apps/myapp-deploy configured
[root@spark32 schedule]# kubectl get pods -o wide    
NAME                            READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
myapp-deploy-6bc4494c9b-695lm   1/1     Running   0          7s    10.244.1.43   spark17    <none>           <none>
myapp-deploy-6bc4494c9b-qjg2h   1/1     Running   0          11s   10.244.2.82   ubuntu31   <none>           <none>
myapp-deploy-6bc4494c9b-qsslj   1/1     Running   0          9s    10.244.1.42   spark17    <none>           <none>

运行在了spark17和ubuntu31节点上。将spark17上污点的effect改为NoExecute：

[root@spark32 schedule]# kubectl taint node spark17 node-type=production:NoExecute
node/spark17 tainted
[root@spark32 schedule]# kubectl get pods -o wide
NAME                            READY   STATUS              RESTARTS   AGE    IP            NODE       NOMINATED NODE   READINESS GATES
myapp-deploy-6bc4494c9b-695lm   1/1     Terminating         0          96s    10.244.1.43   spark17    <none>           <none>
myapp-deploy-6bc4494c9b-mtxp2   0/1     ContainerCreating   0          2s     <none>        ubuntu31   <none>           <none>
myapp-deploy-6bc4494c9b-pl5j9   0/1     ContainerCreating   0          2s     <none>        ubuntu31   <none>           <none>
myapp-deploy-6bc4494c9b-qjg2h   1/1     Running             0          100s   10.244.2.82   ubuntu31   <none>           <none>
myapp-deploy-6bc4494c9b-qsslj   0/1     Terminating         0          98s    10.244.1.42   spark17    <none>           <none>
[root@spark32 schedule]# kubectl get pods -o wide
NAME                            READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
myapp-deploy-6bc4494c9b-mtxp2   1/1     Running   0          22s   10.244.2.84   ubuntu31   <none>           <none>
myapp-deploy-6bc4494c9b-pl5j9   1/1     Running   0          22s   10.244.2.83   ubuntu31   <none>           <none>
myapp-deploy-6bc4494c9b-qjg2h   1/1     Running   0          2m    10.244.2.82   ubuntu31   <none>           <none>

如果想让pod调度到spark17和hadoop16上，必须使得effect值也一样。

[root@spark32 schedule]# kubectl apply -f deploy-taint-demo.yaml 
deployment.apps/myapp-deploy configured
[root@spark32 schedule]# kubectl get pods -o wide
NAME                            READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
myapp-deploy-84dd787cff-nz899   1/1     Running   0          8s    10.244.1.44   spark17    <none>           <none>
myapp-deploy-84dd787cff-rcgwq   1/1     Running   0          5s    10.244.3.20   hadoop16   <none>           <none>
myapp-deploy-84dd787cff-zjsr9   1/1     Running   0          11s   10.244.2.85   ubuntu31   <none>           <none>

去掉三个节点上的污点：

[root@spark32 manifests]# kubectl taint node spark17 node-type-
node/spark17 untainted
[root@spark32 manifests]# kubectl taint node hadoop16 node-type-
node/hadoop16 untainted
[root@spark32 manifests]# kubectl taint node ubuntu31 node-type-
node/ubuntu31 untainted