Fork me on GitHub
Fork me on GitHub

Kubernetes高级调度方式

scheduler调度过程概述

scheduler在实现调度时,分为三步实现调度过程。首先是预选,从所有节点当中选择基本符合条件的节点;而后在众多符合条件的节点当中,在使用优选函数去计算各自的得分并且加以比较,并从最高得分的节点当中随机选择出一个作为运行Pod的节点。这就是控制平面当中scheduler所实现负责的主要工作。
同时如果在某些调度场景当中,我们期望通过自己的预设去影响它的一些调度方式,比如把Pod运行在一些特定的节点之上,可以通过自己的预设操作来影响scheduler的预选和优选的过程,从而使用调度操作能符合我们的期望。
此类的影响方式通常有3种,我们通常称为高级调度设置机制:
节点选择器: nodeSelector, nodeName
节点亲和调度: nodeAffinity

示例1:nodeSelector

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[root@spark32 manifests]# mkdir schedule
[root@spark32 manifests]# cd schedule/
[root@spark32 schedule]# vim pod-demo.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-schedule-demo
namespace: default
labels:
app: myapp
tier: frontend
annotations:
wisedu.com/created-by: "cluster admin"
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
nodeSelector:
disktype: ssd


1
2
3
4
5
6
7
8
9
[root@spark32 schedule]# kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
hadoop16 Ready <none> 40d v1.14.1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=hadoop16,kubernetes.io/os=linux
spark17 Ready <none> 157d v1.14.1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=spark17,kubernetes.io/os=linux
spark32 Ready master 157d v1.14.1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=spark32,kubernetes.io/os=linux,node-role.kubernetes.io/master=
ubuntu31 Ready <none> 157d v1.14.1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=ubuntu31,kubernetes.io/os=linux
[root@spark32 manifests]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-schedule-demo 1/1 Running 0 2m19s 10.244.2.76 ubuntu31 <none> <none>

此前在集群中一个节点ubuntu31上打上过标签 disktype=ssd,所以这个Pod运行会运行在这个节点上。打标签和删除标签的方式如下:

1
2
3
4
[root@spark32 manifests]# kubectl label node ubuntu31 disktype=ssd
node/ubuntu31 labeled
[root@spark32 manifests]# kubectl label node ubuntu31 disktype-
node/ubuntu31 labeled

当从node节点上删除这个 disktype=ssd 标签,只要删除前pod已经运行在这个节点上,那么删除这个标签,pod依然会运行着,不会因此而终止。
现在来修改下这个pod的nodeSelector的值,需要先删除这个pod,修改完重新apply一下这个清单文件:

1
2
3
[root@spark32 schedule]# kubectl delete -f pod-demo.yaml
pod "pod-schedule-demo" deleted
[root@spark32 schedule]# vim pod-demo.yaml


1
2
3
4
5
[root@spark32 schedule]# kubectl apply -f pod-demo.yaml
pod/pod-schedule-demo created
[root@spark32 schedule]# kubectl get pods
NAME READY STATUS RESTARTS AGE
pod-schedule-demo 0/1 Pending 0 8s

此时pod一直处于pending状态。这也就意味着nodeSelector是强约束,在预选阶段就不能满足了。

1
2
3
4
5
[root@spark32 schedule]# kubectl describe pod pod-schedule-demo
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 10s (x3 over 79s) default-scheduler 0/4 nodes are available: 4 node(s) didn't match node selector.

给集群中其中一个节点打上该标签,pod立马被调度到这台节点上运行了:

1
2
3
4
5
[root@spark32 schedule]# kubectl label node spark17 disktype=harddisk
node/spark17 labeled
[root@spark32 schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-schedule-demo 1/1 Running 0 2m25s 10.244.1.36 spark17 <none> <none>

1
2
[root@spark32 schedule]# kubectl delete -f pod-demo.yaml
pod "pod-schedule-demo" deleted

示例2:pods.spec.affinity中nodeAffinity

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[root@spark32 ~]# kubectl explain pods.spec.affinity
KIND: Pod
VERSION: v1
RESOURCE: affinity <Object>
DESCRIPTION:
If specified, the pod's scheduling constraints
Affinity is a group of affinity scheduling rules.
FIELDS:
nodeAffinity <Object>
Describes node affinity scheduling rules for the pod.
podAffinity <Object>
Describes pod affinity scheduling rules (e.g. co-locate this pod in the
same node, zone, etc. as some other pod(s)).
podAntiAffinity <Object>
Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod
in the same node, zone, etc. as some other pod(s)).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[root@spark32 ~]# kubectl explain pods.spec.affinity.nodeAffinity
KIND: Pod
VERSION: v1
RESOURCE: nodeAffinity <Object>
DESCRIPTION:
Describes node affinity scheduling rules for the pod.
Node affinity is a group of node affinity scheduling rules.
FIELDS:
preferredDuringSchedulingIgnoredDuringExecution <[]Object>
The scheduler will prefer to schedule pods to nodes that satisfy the
affinity expressions specified by this field, but it may choose a node that
violates one or more of the expressions. The node that is most preferred is
the one with the greatest sum of weights, i.e. for each node that meets all
of the scheduling requirements (resource request, requiredDuringScheduling
affinity expressions, etc.), compute a sum by iterating through the
elements of this field and adding "weight" to the sum if the node matches
the corresponding matchExpressions; the node(s) with the highest sum are
the most preferred.
requiredDuringSchedulingIgnoredDuringExecution <Object>
If the affinity requirements specified by this field are not met at
scheduling time, the pod will not be scheduled onto the node. If the
affinity requirements specified by this field cease to be met at some point
during pod execution (e.g. due to an update), the system may or may not try
to eventually evict the pod from its node.
  • preferredDuringSchedulingIgnoredDuringExecution <[]Object>
    倾向,尽量满足的条件。不满足也行。尽量运行在满足这里定义的亲和条件的节点上。
  • requiredDuringSchedulingIgnoredDuringExecution
    硬亲和性,必须得满足的条件。如果用required,实现的效果就和nodeSelector一样,如果没有任何节点满足这里的亲和定义,那么一定不会去运行,就处于pending状态了。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
[root@spark32 ~]# kubectl explain pods.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution
KIND: Pod
VERSION: v1
RESOURCE: requiredDuringSchedulingIgnoredDuringExecution <Object>
DESCRIPTION:
If the affinity requirements specified by this field are not met at
scheduling time, the pod will not be scheduled onto the node. If the
affinity requirements specified by this field cease to be met at some point
during pod execution (e.g. due to an update), the system may or may not try
to eventually evict the pod from its node.
A node selector represents the union of the results of one or more label
queries over a set of nodes; that is, it represents the OR of the selectors
represented by the node selector terms.
FIELDS:
nodeSelectorTerms <[]Object> -required-
Required. A list of node selector terms. The terms are ORed.
You have mail in /var/spool/mail/root
[root@spark32 ~]# kubectl explain pods.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms
KIND: Pod
VERSION: v1
RESOURCE: nodeSelectorTerms <[]Object>
DESCRIPTION:
Required. A list of node selector terms. The terms are ORed.
A null or empty node selector term matches no objects. The requirements of
them are ANDed. The TopologySelectorTerm type implements a subset of the
NodeSelectorTerm.
FIELDS:
matchExpressions <[]Object>
A list of node selector requirements by node's labels.
matchFields <[]Object>
A list of node selector requirements by node's fields.
  • matchExpressions <[]Object>
    A list of node selector requirements by node’s labels. 匹配表达式的
  • matchFields <[]Object>
    A list of node selector requirements by node’s fields. 匹配字段的

下面定义示例yaml文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[root@spark32 schedule]# vim pod-nodeaffinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodeaffinity-demo
namespace: default
labels:
app: myapp
tier: frontend
annotations:
wisedu.com/created-by: "cluster admin"
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: zone
operator: In
values: ["foo", "bar"]
[root@spark32 schedule]# kubectl apply -f pod-nodeaffinity.yaml
pod/pod-nodeaffinity-demo created

清单中定义的是硬亲和性,目前还没有节点拥有标签zone,并且值在foo和bar内。

1
2
3
[root@spark32 schedule]# kubectl get pods
NAME READY STATUS RESTARTS AGE
pod-nodeaffinity-demo 0/1 Pending 0 5s

找一个集群中的节点打上一个标签 zone=bar:

1
2
3
4
5
[root@spark32 schedule]# kubectl label node hadoop16 zone=bar
node/hadoop16 labeled
[root@spark32 schedule]# kubectl get pods
NAME READY STATUS RESTARTS AGE
pod-nodeaffinity-demo 1/1 Running 0 11m

删除标签和pod:

1
2
3
4
[root@spark32 schedule]# kubectl label node hadoop16 zone-
node/hadoop16 labeled
[root@spark32 schedule]# kubectl delete -f pod-nodeaffinity.yaml
pod "pod-nodeaffinity-demo" deleted

下面定义一个软亲和性的清单文件(即使定义的条件不满足,也会勉为其难地找一个节点运行pod):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[root@spark32 schedule]# vim pod-nodeaffinity-demo2.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodeaffinity-demo2
namespace: default
labels:
app: myapp
tier: frontend
annotations:
wisedu.com/created-by: "cluster admin"
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: zone
operator: In
values: ["foo", "bar"]
weight: 60
[root@spark32 schedule]# kubectl apply -f pod-nodeaffinity-demo2.yaml
pod/pod-nodeaffinity-demo2 created
[root@spark32 schedule]# kubectl get pods
NAME READY STATUS RESTARTS AGE
pod-nodeaffinity-demo2 1/1 Running 0 6s

删除这个pod:

1
2
[root@spark32 schedule]# kubectl delete -f pod-nodeaffinity-demo2.yaml
pod "pod-nodeaffinity-demo2" deleted

示例3:pods.spec.affinity中的podAffinity和podAntiAffinity

一般是出于高效通信的需求,偶尔需要把一些pod对象组织在相近的位置,比如运行在同一节点,同一机架,同一区域,同一地区等等,这样子pod和pod通信效率更高。
主要目的是把pod运行在一起,或不运行在一起,其实通过节点亲和性就能达到目的。比如3个Pod,NMP,使用3个同样的节点标签,而后我们在节点上打标签的时候就确保它3个选择的标签的节点就在同一个位置,就在同一个机架上,也能达到这个目的。但是为何还要定义pod亲和性和反亲和性呢?是因为使用节点亲和性去限制pod,它不是一种较优的选择方式,需要精心布局节点是被打上什么标签的才能实现目的。这种方式使用起来可能难度较大。
较理想的方式是,允许调度器把第一个Pod随机选择一个位置,但是第二个Pod就要根据第一个pod所在的位置来进行调度。Pod的亲和性并不强制一定要在同一个节点,相近的就可以了。所以要定义什么叫同一位置,什么叫不同位置。
如何判定是哪些节点是相同位置,哪些节点是不同位置是需要定义的。比如现在有4个正常运行的主机,当第一个Pod运行在第一个节点上之后,如何判定第2、3、4节点是否可以运行与第一个Pod亲和的Pod?比如NMP是亲和的,N被放在了第一个节点,M是应该放在第一个节点上,还是其他节点都不能放?Pod的亲和性并不强制一定要在同一个节点,非要把NMP放在同一个节点,假如这个节点资源不够,这样也不是一个最佳选择。比如我们定义,把N、M、P分别运行在一个节点上,只要这3个节点在一个机柜内,那就认为这是满足亲和条件的。所以在定义Pod亲和性时必须有个判断前提,Pod和Pod要在同一位置和不要在同一位置的判断标准是什么。因此什么叫同一位置,什么叫不同位置就很关键了。当以节点名称来判定这几个节点是不是同一位置,很显然这4个节点都是不同的位置。节点名相同的就认为是同一位置,不同的就认为是不同位置。所以如果把N这个Pod运行在节点1上,M和P也是会被调度到节点1上的。
换个判定标准,比如判定是否是同一位置的标准是:节点标签rack(机架)相同的就是同一位置。比如rack=rack1,两个节点都有这个标签,并且值为rack1的,那么两个节点就是同一位置。另外两个节点的标签是rack=rack2。所以此时假如第一个Pod运行在第一个节点上,也就是rack=rack1的节点上,那么M和P可以运行在第一个和第二个节点上。

podAffinity

Pod亲和性也有硬亲和性和软亲和性。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[root@spark32 manifests]# kubectl explain pods.spec.affinity.podAffinity
KIND: Pod
VERSION: v1
RESOURCE: podAffinity <Object>
DESCRIPTION:
Describes pod affinity scheduling rules (e.g. co-locate this pod in the
same node, zone, etc. as some other pod(s)).
Pod affinity is a group of inter pod affinity scheduling rules.
FIELDS:
preferredDuringSchedulingIgnoredDuringExecution <[]Object>
The scheduler will prefer to schedule pods to nodes that satisfy the
affinity expressions specified by this field, but it may choose a node that
violates one or more of the expressions. The node that is most preferred is
the one with the greatest sum of weights, i.e. for each node that meets all
of the scheduling requirements (resource request, requiredDuringScheduling
affinity expressions, etc.), compute a sum by iterating through the
elements of this field and adding "weight" to the sum if the node has pods
which matches the corresponding podAffinityTerm; the node(s) with the
highest sum are the most preferred.
requiredDuringSchedulingIgnoredDuringExecution <[]Object>
If the affinity requirements specified by this field are not met at
scheduling time, the pod will not be scheduled onto the node. If the
affinity requirements specified by this field cease to be met at some point
during pod execution (e.g. due to a pod label update), the system may or
may not try to eventually evict the pod from its node. When there are
multiple elements, the lists of nodes corresponding to each podAffinityTerm
are intersected, i.e. all terms must be satisfied.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
[root@spark32 manifests]# kubectl explain pods.spec.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution
KIND: Pod
VERSION: v1
RESOURCE: requiredDuringSchedulingIgnoredDuringExecution <[]Object>
DESCRIPTION:
If the affinity requirements specified by this field are not met at
scheduling time, the pod will not be scheduled onto the node. If the
affinity requirements specified by this field cease to be met at some point
during pod execution (e.g. due to a pod label update), the system may or
may not try to eventually evict the pod from its node. When there are
multiple elements, the lists of nodes corresponding to each podAffinityTerm
are intersected, i.e. all terms must be satisfied.
Defines a set of pods (namely those matching the labelSelector relative to
the given namespace(s)) that this pod should be co-located (affinity) or
not co-located (anti-affinity) with, where co-located is defined as running
on a node whose value of the label with key <topologyKey> matches that of
any node on which a pod of the set of pods is running
FIELDS:
labelSelector <Object>
A label query over a set of resources, in this case pods.
namespaces <[]string>
namespaces specifies which namespaces the labelSelector applies to (matches
against); null or empty list means "this pod's namespace"
topologyKey <string> -required-
This pod should be co-located (affinity) or not co-located (anti-affinity)
with the pods matching the labelSelector in the specified namespaces, where
co-located is defined as running on a node whose value of the label with
key topologyKey matches that of any node on which any of the selected pods
is running. Empty topologyKey is not allowed.
  • labelSelector
    A label query over a set of resources, in this case pods. 和哪个Pod亲和,用来选定一组pod
  • namespaces <[]string>
    namespaces specifies which namespaces the labelSelector applies to (matches against); null or empty list means “this pod’s namespace” 指明labelSelector匹配到一组pod到底是哪个名称空间的,不指意味着与要创建的新pod一个名称空间的
  • topologyKey -required-
    位置拓扑的键,用来判定是不是同一个位置。用哪个键来判定是不是同一位置

接下来定义两个pod,第一个是基准,第二个跟第一个走。
每一个节点都有个标签叫 kubernetes.io/hostname。

1
2
3
4
5
6
[root@spark32 manifests]# kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
hadoop16 Ready <none> 40d v1.14.1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=hadoop16,kubernetes.io/os=linux
spark17 Ready <none> 157d v1.14.1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=harddisk,kubernetes.io/arch=amd64,kubernetes.io/hostname=spark17,kubernetes.io/os=linux
spark32 Ready master 157d v1.14.1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=spark32,kubernetes.io/os=linux,node-role.kubernetes.io/master=
ubuntu31 Ready <none> 157d v1.14.1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=ubuntu31,kubernetes.io/os=linux

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[root@spark32 schedule]# vim pod-podaffinity-demo.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-podaffinity-first
labels:
app: myapp
tier: frontend
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
---
apiVersion: v1
kind: Pod
metadata:
name: pod-podaffinity-second
namespace: default
labels:
app: backend
tier: db
spec:
containers:
- name: busybox
image: busybox:latest
imagePullPolicy: IfNotPresent
command: ["sh", "-c", "sleep 3600"]
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
#- {key: app, operator: In, values: ["myapp"]}
- key: app
operator: In
values: ["myapp"]
topologyKey: kubernetes.io/hostname


1
2
3
4
5
6
7
[root@spark32 schedule]# kubectl apply -f pod-podaffinity-demo.yaml
pod/pod-podaffinity-first created
pod/pod-podaffinity-second created
[root@spark32 schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-podaffinity-first 1/1 Running 0 7s 10.244.2.77 ubuntu31 <none> <none>
pod-podaffinity-second 1/1 Running 0 6s 10.244.2.78 ubuntu31 <none> <none>

两个pod都运行在第二个节点上。
删除这两个pod:

1
2
3
[root@spark32 schedule]# kubectl delete -f pod-podaffinity-demo.yaml
pod "pod-podaffinity-first" deleted
pod "pod-podaffinity-second" deleted

podAntiAffinity

podAffinity和podAntiAffinity区别是:二者的标签的值不能是相同的。

1
[root@spark32 schedule]# vim pod-podantiaffinity-demo.yaml


1
2
3
4
5
6
7
[root@spark32 schedule]# kubectl apply -f pod-podantiaffinity-demo.yaml
pod/pod-podaffinity-first created
pod/pod-podaffinity-second created
[root@spark32 schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-podaffinity-first 1/1 Running 0 7s 10.244.2.79 ubuntu31 <none> <none>
pod-podaffinity-second 1/1 Running 0 7s 10.244.1.40 spark17 <none> <none>

一定不在同一个节点上。

1
2
3
[root@spark32 schedule]# kubectl delete -f pod-podantiaffinity-demo.yaml
pod "pod-podaffinity-first" deleted
pod "pod-podaffinity-second" deleted

将集群中的三个node节点,都打上同一个标签 zone=foo,其中spark32为master节点:

1
2
3
4
5
6
7
8
9
10
11
12
[root@spark32 schedule]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
hadoop16 Ready <none> 40d v1.14.1
spark17 Ready <none> 157d v1.14.1
spark32 Ready master 157d v1.14.1
ubuntu31 Ready <none> 157d v1.14.1
[root@spark32 schedule]# kubectl label node spark17 zone=foo
node/spark17 labeled
[root@spark32 schedule]# kubectl label node ubuntu31 zone=foo
node/ubuntu31 labeled
[root@spark32 schedule]# kubectl label node hadoop16 zone=foo
node/hadoop16 labeled

修改 pod-podantiaffinity-demo.yaml 中topologyKey的值:

1
2
3
4
5
6
7
[root@spark32 schedule]# kubectl apply -f pod-podantiaffinity-demo.yaml
pod/pod-podaffinity-first created
pod/pod-podaffinity-second created
[root@spark32 schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-podaffinity-first 1/1 Running 0 2s 10.244.2.81 ubuntu31 <none> <none>
pod-podaffinity-second 0/1 Pending 0 2s <none> <none> <none> <none>

第一个Pod在运行,第二个Pod处于pending状态。

1
2
3
[root@spark32 schedule]# kubectl delete -f pod-podantiaffinity-demo.yaml
pod "pod-podaffinity-first" deleted
pod "pod-podaffinity-second" deleted

示例4:污点调度

污点就是定义在节点上的键值数据。键值数据有三类:标签、注解、污点。污点是运行在节点上的,不像标签和注解,所有资源都能用。
这就给了节点选择权,给节点打一些污点,pod不容忍就不能运行上来。我们需要在Pod上定义容忍度。容忍度tolerations是Pod对象上的第三种键值数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[root@spark32 schedule]# kubectl explain nodes
[root@spark32 schedule]# kubectl explain nodes.spec
[root@spark32 schedule]# kubectl explain nodes.spec.taints
KIND: Node
VERSION: v1
RESOURCE: taints <[]Object>
DESCRIPTION:
If specified, the node's taints.
The node this Taint is attached to has the "effect" on any pod that does
not tolerate the Taint.
FIELDS:
effect <string> -required-
Required. The effect of the taint on pods that do not tolerate the taint.
Valid effects are NoSchedule, PreferNoSchedule and NoExecute.
key <string> -required-
Required. The taint key to be applied to a node.
timeAdded <string>
TimeAdded represents the time at which the taint was added. It is only
written for NoExecute taints.
value <string>
Required. The taint value corresponding to the taint key.

  • effect -required-
    当Pod不能容忍这个污点时,要采取的行为是什么。taint的effect定义对Pod排斥效果:
  • NoSchedule:仅影响调度过程,对现存的Pod对象不产生影响;
  • NoExecute:既影响调度过程,也影响现在的Pod对象;不容忍的Pod对象将被驱逐;
  • PreferNoSchedule: 不能容忍,但是没地方运行也可以过来运行。最好不,表示也可以。

在Pod对象上定义容忍度的时候,还支持两种操作。等值比较和存在性判断。所谓等值比较,需要在key、value、effect上完全匹配。存在性判断表示二者的key和effect必须匹配,但是value可以使用空值,即判断存在不存在与否即可。一个节点可以配置多个污点,一个Pod也可以有多个容忍度。只不过二者匹配时要遵循如下的逻辑。
比如在Pod上定义了3个容忍度,在节点之上定义了2个污点,这个pod一定能运行在这个节点上吗?不一定,pod容忍了其中一个污点,另外一个没有容忍,这种情况是可能的。要逐一检查节点的污点,节点的每一个污点都必须被Pod容忍。如果某个污点被Pod的容忍度匹配到了,那么这个污点就过了,检查下一个。如果存在污点不被pod所容忍,就要看这个污点的条件了。如果这个污点的行为是PreferNoSchedule,那么事实上还是可以运行在这个节点上的。但是如果这个污点的行为是NoSchedule,就一定不能被调度到这个节点上了。
此前在运行pod时,没有一个pod会被调度到master节点上运行,是因为master上默认就有污点,我们定义的Pod都没有去定义容忍度去匹配master节点上的这个污点。master是用来运行集群控制平面组件的,可以看看这几个Pod,肯定定义了容忍度去匹配这个污点。

master节点上打的污点,value为空值,即判断表示存在不存在。
看下api-server这个pod中定义的容忍度:

1
[root@spark32 manifests]# kubectl get pod kube-apiserver-spark32 -n kube-system -o yaml

管理节点的污点:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
[root@spark32 schedule]# kubectl taint --help
Update the taints on one or more nodes.
* A taint consists of a key, value, and effect. As an argument here, it is expressed as key=value:effect.
* The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores, up to
253 characters.
* Optionally, the key can begin with a DNS subdomain prefix and a single '/', like example.com/my-app
* The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores, up
to 63 characters.
* The effect must be NoSchedule, PreferNoSchedule or NoExecute.
* Currently taint can only apply to node.
Examples:
# Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoSchedule'.
# If a taint with that key and effect already exists, its value is replaced as specified.
kubectl taint nodes foo dedicated=special-user:NoSchedule
# Remove from node 'foo' the taint with key 'dedicated' and effect 'NoSchedule' if one exists.
kubectl taint nodes foo dedicated:NoSchedule-
# Remove from node 'foo' all the taints with key 'dedicated'
kubectl taint nodes foo dedicated-
# Add a taint with key 'dedicated' on nodes having label mylabel=X
kubectl taint node -l myLabel=X dedicated=foo:PreferNoSchedule
Options:
--all=false: Select all nodes in the cluster
--allow-missing-template-keys=true: If true, ignore any errors in templates when a field or map key is missing in
the template. Only applies to golang and jsonpath output formats.
-o, --output='': Output format. One of:
json|yaml|name|go-template|go-template-file|template|templatefile|jsonpath|jsonpath-file.
--overwrite=false: If true, allow taints to be overwritten, otherwise reject taint updates that overwrite existing
taints.
-l, --selector='': Selector (label query) to filter on, supports '=', '==', and '!='.(e.g. -l key1=value1,key2=value2)
--template='': Template string or path to template file to use when -o=go-template, -o=go-template-file. The
template format is golang templates [http://golang.org/pkg/text/template/#pkg-overview].
--validate=true: If true, use a schema to validate the input before sending it
Usage:
kubectl taint NODE NAME KEY_1=VAL_1:TAINT_EFFECT_1 ... KEY_N=VAL_N:TAINT_EFFECT_N [options]
Use "kubectl options" for a list of global command-line options (applies to all commands).

现在集群中有3个node节点,给其中两个node节点打上污点如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@spark32 schedule]# kubectl taint node spark17 node-type=production:NoSchedule
node/spark17 tainted
[root@spark32 schedule]# kubectl taint node ubuntu31 node-type=production:NoSchedule
node/ubuntu31 tainted
[root@spark32 schedule]# kubectl get node ubuntu31 -o yaml
...
spec:
podCIDR: 10.244.2.0/24
taints:
- effect: NoSchedule
key: node-type
value: production
...
[root@spark32 schedule]# kubectl describe node ubuntu31
...
Taints: node-type=production:NoSchedule
...

定义一个Deployment,不定义其中Pod的污点容忍度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[root@spark32 schedule]# vim deploy-taint-demo.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-deploy
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: myapp
release: canary
template:
metadata:
labels:
app: myapp
release: canary
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v2
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 80

1
2
3
4
5
6
7
[root@spark32 schedule]# kubectl apply -f deploy-taint-demo.yaml
deployment.apps/myapp-deploy created
[root@spark32 schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp-deploy-675558bfc5-99xj9 1/1 Running 0 7s 10.244.3.18 hadoop16 <none> <none>
myapp-deploy-675558bfc5-scqtz 1/1 Running 0 7s 10.244.3.19 hadoop16 <none> <none>
myapp-deploy-675558bfc5-wsrdp 1/1 Running 0 7s 10.244.3.17 hadoop16 <none> <none>

这样这个Deployment中的Pod都运行在没打污点的那个节点上。

将没打污点的节点hadoop16也打上污点,并且effect为NoExecute:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@spark32 schedule]# kubectl taint node hadoop16 node-type=production:NoExecute
node/hadoop16 tainted
[root@spark32 schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp-deploy-675558bfc5-99xj9 1/1 Terminating 0 2m38s 10.244.3.18 hadoop16 <none> <none>
myapp-deploy-675558bfc5-bl5nk 0/1 Pending 0 2s <none> <none> <none> <none>
myapp-deploy-675558bfc5-fdmq2 0/1 Pending 0 2s <none> <none> <none> <none>
myapp-deploy-675558bfc5-scqtz 1/1 Terminating 0 2m38s 10.244.3.19 hadoop16 <none> <none>
myapp-deploy-675558bfc5-wsrdp 1/1 Terminating 0 2m38s 10.244.3.17 hadoop16 <none> <none>
myapp-deploy-675558bfc5-xd98m 0/1 Pending 0 2s <none> <none> <none> <none>
[root@spark32 schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp-deploy-675558bfc5-bl5nk 0/1 Pending 0 18s <none> <none> <none> <none>
myapp-deploy-675558bfc5-fdmq2 0/1 Pending 0 18s <none> <none> <none> <none>
myapp-deploy-675558bfc5-xd98m 0/1 Pending 0 18s <none> <none> <none> <none>

运行在hadoop16上的pod被驱逐了。

下面看看如何在pod上定义tolerations。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
[root@spark32 schedule]# kubectl explain pods.spec.tolerations
KIND: Pod
VERSION: v1
RESOURCE: tolerations <[]Object>
DESCRIPTION:
If specified, the pod's tolerations.
The pod this Toleration is attached to tolerates any taint that matches the
triple <key,value,effect> using the matching operator <operator>.
FIELDS:
effect <string>
Effect indicates the taint effect to match. Empty means match all taint
effects. When specified, allowed values are NoSchedule, PreferNoSchedule
and NoExecute.
key <string>
Key is the taint key that the toleration applies to. Empty means match all
taint keys. If the key is empty, operator must be Exists; this combination
means to match all values and all keys.
operator <string>
Operator represents a key's relationship to the value. Valid operators are
Exists and Equal. Defaults to Equal. Exists is equivalent to wildcard for
value, so that a pod can tolerate all taints of a particular category.
tolerationSeconds <integer>
TolerationSeconds represents the period of time the toleration (which must
be of effect NoExecute, otherwise this field is ignored) tolerates the
taint. By default, it is not set, which means tolerate the taint forever
(do not evict). Zero and negative values will be treated as 0 (evict
immediately) by the system.
value <string>
Value is the taint value the toleration matches to. If the operator is
Exists, the value should be empty, otherwise just a regular string.

  • tolerationSeconds:被驱逐时可以等待多久被驱逐,默认是0,立即驱逐。

修改deploy-taint-demo.yaml,定义tolerations:

1
[root@spark32 schedule]# vim deploy-taint-demo.yaml

1
2
3
4
5
6
7
[root@spark32 schedule]# kubectl apply -f deploy-taint-demo.yaml
deployment.apps/myapp-deploy configured
[root@spark32 schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp-deploy-6bc4494c9b-695lm 1/1 Running 0 7s 10.244.1.43 spark17 <none> <none>
myapp-deploy-6bc4494c9b-qjg2h 1/1 Running 0 11s 10.244.2.82 ubuntu31 <none> <none>
myapp-deploy-6bc4494c9b-qsslj 1/1 Running 0 9s 10.244.1.42 spark17 <none> <none>

运行在了spark17和ubuntu31节点上。将spark17上污点的effect改为NoExecute:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@spark32 schedule]# kubectl taint node spark17 node-type=production:NoExecute
node/spark17 tainted
[root@spark32 schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp-deploy-6bc4494c9b-695lm 1/1 Terminating 0 96s 10.244.1.43 spark17 <none> <none>
myapp-deploy-6bc4494c9b-mtxp2 0/1 ContainerCreating 0 2s <none> ubuntu31 <none> <none>
myapp-deploy-6bc4494c9b-pl5j9 0/1 ContainerCreating 0 2s <none> ubuntu31 <none> <none>
myapp-deploy-6bc4494c9b-qjg2h 1/1 Running 0 100s 10.244.2.82 ubuntu31 <none> <none>
myapp-deploy-6bc4494c9b-qsslj 0/1 Terminating 0 98s 10.244.1.42 spark17 <none> <none>
[root@spark32 schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp-deploy-6bc4494c9b-mtxp2 1/1 Running 0 22s 10.244.2.84 ubuntu31 <none> <none>
myapp-deploy-6bc4494c9b-pl5j9 1/1 Running 0 22s 10.244.2.83 ubuntu31 <none> <none>
myapp-deploy-6bc4494c9b-qjg2h 1/1 Running 0 2m 10.244.2.82 ubuntu31 <none> <none>

如果想让pod调度到spark17和hadoop16上,必须使得effect值也一样。

1
2
3
4
5
6
7
[root@spark32 schedule]# kubectl apply -f deploy-taint-demo.yaml
deployment.apps/myapp-deploy configured
[root@spark32 schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp-deploy-84dd787cff-nz899 1/1 Running 0 8s 10.244.1.44 spark17 <none> <none>
myapp-deploy-84dd787cff-rcgwq 1/1 Running 0 5s 10.244.3.20 hadoop16 <none> <none>
myapp-deploy-84dd787cff-zjsr9 1/1 Running 0 11s 10.244.2.85 ubuntu31 <none> <none>

去掉三个节点上的污点:

1
2
3
4
5
6
[root@spark32 manifests]# kubectl taint node spark17 node-type-
node/spark17 untainted
[root@spark32 manifests]# kubectl taint node hadoop16 node-type-
node/hadoop16 untainted
[root@spark32 manifests]# kubectl taint node ubuntu31 node-type-
node/ubuntu31 untainted