21 高级调度


污点和容忍度

1
2
3
4
5
1. 之前我们可以通过节点选择器,来实现Pod调度到哪一台Node节点上
2. 现在可以通过 节点污点和Pod容忍度 限制 Pod 调度到指定节点
3. 只有当一个Pod容忍某个Node节点的污点,这个Pod才能被调度到该节点
4. 节点选择器和节点亲和性规则 都是通过再Pod中添加明确的信息 来决定Pod是否可以调度到指定节点
5. 污点是再不通过修改Pod的前提下,通过再节点上添加污点信息,来拒绝Pod在指定Mode节点上部署

显示 节点 的污点信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# 默认情况下 master主节点要设置污点 
# 显示节点的污点信息
[root@k8s-master1 ~]# kubectl describe node k8s-master
CreationTimestamp: Mon, 09 Mar 2020 17:02:12 +0800
Taints: <none> # 暂时没有污点信息 (二进制安装)

### 显示 Pod 的污点容忍度

# coredns、flannel和 nginx-ingress-controller 都是以Pod的形式不是再K8S集群当中
# 如果你要求这些Pod运行在master节点上,master如果不设置任何污点信息是可以的,但是如果需要就要添加符合的污点信息
# Pod的污点容忍度 需要匹配 主节点的污点信息


[root@k8s-master1 ~]# kubectl describe pod coredns-6d8cfdd59d-w2xnx -n kube-system
...
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s

[root@k8s-master1 ~]# kubectl describe pod kube-flannel-ds-amd64-xq7gq -n kube-system
...
Tolerations: :NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule

[root@k8s-master1 ~]# kubectl describe pod nginx-ingress-controller-m8ljl -n ingress-nginx
...
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule

了解污点效果

1
2
3
4
5
6
7
8
9
10
11
...
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s

# 1. not-ready 和 unreachable 是描述当前节点的状态 ,如果处于该状态那么Pod可以运行在该Node上多长时间,后面跟秒数
# 2. NoSchedule 和 NoExecute 都属于污点效果
# 3. 每一个污点都可以关联一种效果
# 4. NoSchedule : 表示如果Pod没有容忍这些污点,Pod不能调度到包含这些污点的Node节点上
# 5. NoExecute : 会影响到正在运行在该Node节点上的Pod,如果在一个节点上添加了NoExecute污点,那些正在该Node节点上运行的Pod如果没有容忍
这个污点,则会被从节点上去除。

在 Node 节点上添加自定义污点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1. 常用在一套K8S环境中,区分生产环境和测试环境
2. 测试环境的Pod 不能够部署到 生产环境

# 在node1添加生产环境污点信息
[root@k8s-master1 ~]# kubectl taint node k8s-node1 node-type=production:NoSchedule
node/k8s-node1 tainted

# 污点描述:
# key = node-type
# value = production
# 效果 = NoSchedule

[root@k8s-master1 ~]# kubectl describe node k8s-node1

...
Taints: node-type=production:NoSchedule
...

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 60m kube-proxy, k8s-node1 Starting kube-proxy.
Normal Starting 60m kubelet, k8s-node1 Starting kubelet.
Normal NodeHasSufficientMemory 60m kubelet, k8s-node1 Node k8s-node1 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 60m kubelet, k8s-node1 Node k8s-node1 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 60m kubelet, k8s-node1 Node k8s-node1 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 60m kubelet, k8s-node1 Updated Node Allocatable limit across pods
Warning Rebooted 60m kubelet, k8s-node1 Node k8s-node1 has been rebooted, boot id: f41c8744-b629-422b-8271-96aa873a7b73
1
2
3
4
5
6
7
8
9
10
11
12
13
# 测试部署一套没有容忍度的Pod  
# 看看有没有Pod部署到有污点 node-type=production:NoSchedule 的Node1上

[root@k8s-master1 ~]# kubectl run test --image=busybox --replicas=5 -- sleep 99999
deployment.apps/test created
[root@k8s-master1 ~]# kubectl get pods -o wide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-55c88c5b54-5tts2 1/1 Running 0 20s 10.244.2.103 k8s-master1 <none> <none>
test-55c88c5b54-f9fgq 1/1 Running 0 20s 10.244.1.134 k8s-node2 <none> <none>
test-55c88c5b54-krppr 1/1 Running 0 20s 10.244.1.136 k8s-node2 <none> <none>
test-55c88c5b54-kwz8b 1/1 Running 0 20s 10.244.1.135 k8s-node2 <none> <none>
test-55c88c5b54-rhxg7 1/1 Running 0 20s 10.244.2.104 k8s-master1 <none> <none>

在 Pod 上添加污点容忍度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# 导出 原deployment 
[root@k8s-master1 Taints]# kubectl get deployment test -o yaml > proc-deployment.yaml

# 修改 增加污点容忍度

[root@k8s-master1 Taints]# vim proc-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
labels:
run: test
name: test
namespace: default
spec:
replicas: 5
selector:
matchLabels:
run: test
template:
metadata:
labels:
run: test
spec:
containers:
- args:
- sleep
- "99999"
image: busybox
imagePullPolicy: Always
name: test
tolerations: # 添加污点容忍度 让Pod允许被调度到生产环境节点上
- key: node-type
operator: Equal
value: production
effect: NoSchedule
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 部署并查看

[root@k8s-master1 Taints]# kubectl delete deployment test
deployment.apps "test" deleted

[root@k8s-master1 Taints]# kubectl create -f proc-deployment.yaml
deployment.apps/test created

[root@k8s-master1 Taints]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-5458cfd747-2p6wh 1/1 Running 0 24s 10.244.1.138 k8s-node2 <none> <none>
test-5458cfd747-4xjsx 1/1 Running 0 24s 10.244.2.105 k8s-master1 <none> <none>
test-5458cfd747-8tc8w 1/1 Running 0 24s 10.244.0.95 k8s-node1 <none> <none>
test-5458cfd747-drm2d 1/1 Running 0 24s 10.244.1.137 k8s-node2 <none> <none>
test-5458cfd747-psxtw 1/1 Running 0 24s 10.244.0.96 k8s-node1 <none> <none>

# 这样就可以将Pod调度到生产环境Node1上
# 如果node2是测试环境 可以增加污点信息 : node-type=test:NoSchedule
# 测试环境的Pod 也要加上污点容忍度 指向测试环境 node-type=test:NoSchedule

污点和污点容忍度的使用场景

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1. 区分生产环境和测试环境
2. 区分不同团队部署的不同项目
3. 区分硬件资源 如SSD或者其他资源
4. 配置节点失效之后的Pod重新调度最长等待时间,每个Pod都可以配置某种污点的效果等待时间
5. 当k8s检测到Node出现 not-ready 或者 unreachable状态的时候,将会等待300秒,如果状态持续的话,会将该Pod调度到其他节点上
6. 如果没有设置这两个状态,是会自动添加到Pod中的,如果5分钟太长,可以修改该值


[root@k8s-master1 Taints]# kubectl describe pod test-5458cfd747-2p6wh

Tolerations: node-type=production:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s

[root@k8s-master1 Taints]# kubectl edit pod test-5458cfd747-2p6wh
...
tolerations:
- effect: NoSchedule
key: node-type
operator: Equal
value: production
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
...

测试二进制安装 K8S 重启后污点是否使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1. 由于二进制安装的Node都默认没有污点,测试设置污点后重启Node之前的组件Pod是否可以重建
2. 看起来正常,后续应该多测试

[root@k8s-master1 ~]# kubectl get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default nfs-client-provisioner-56f4b98d47-v4nf6 1/1 Running 16 10d 10.244.0.97 k8s-node1 <none> <none>
default test-5458cfd747-2p6wh 1/1 Running 1 19m 10.244.1.141 k8s-node2 <none> <none>
default test-5458cfd747-4xjsx 1/1 Running 1 19m 10.244.2.108 k8s-master1 <none> <none>
default test-5458cfd747-8tc8w 1/1 Running 1 19m 10.244.0.99 k8s-node1 <none> <none>
default test-5458cfd747-drm2d 1/1 Running 1 19m 10.244.1.140 k8s-node2 <none> <none>
default test-5458cfd747-psxtw 1/1 Running 1 19m 10.244.0.100 k8s-node1 <none> <none>
ingress-nginx nginx-ingress-controller-6txhq 1/1 Running 28 21d 172.31.228.70 k8s-node2 <none> <none>
ingress-nginx nginx-ingress-controller-jqxrm 1/1 Running 27 21d 172.31.228.69 k8s-node1 <none> <none>
ingress-nginx nginx-ingress-controller-m8ljl 1/1 Running 27 21d 172.31.228.67 k8s-master1 <none> <none>
kube-system coredns-6d8cfdd59d-w2xnx 1/1 Running 23 19d 10.244.2.107 k8s-master1 <none> <none>
kube-system kube-flannel-ds-amd64-q29r8 1/1 Running 28 21d 172.31.228.70 k8s-node2 <none> <none>
kube-system kube-flannel-ds-amd64-xh7c4 1/1 Running 28 21d 172.31.228.69 k8s-node1 <none> <none>
kube-system kube-flannel-ds-amd64-xq7gq 1/1 Running 29 21d 172.31.228.67 k8s-master1 <none> <none>
kube-system metrics-server-7dbbcf4c7-6qrsb 1/1 Running 2 2d7h 10.244.1.139 k8s-node2 <none> <none>
kubernetes-dashboard dashboard-metrics-scraper-566cddb686-t8fw4 1/1 Running 23 19d 10.244.2.106 k8s-master1 <none> <none>
kubernetes-dashboard kubernetes-dashboard-c4bc5bd44-4prrf 1/1 Running 27 21d 10.244.0.98 k8s-node1 <none> <none>

删除 Node上的污点信息

1
2
3
4
5
6
7
[root@k8s-master1 Taints]# kubectl taint nodes k8s-node1 node-type=test:NoSchedule-
node/k8s-node1 untainted

[root@k8s-master1 Taints]# kubectl describe node k8s-node1
...
Taints: <none>
...

通过 节点亲和性 调度Pod到指定节点

1
2
3
4
5
1. 节点亲和性(node affinity) 对比 污点是更新的一种技术, 他用来告诉K8S将Pod只调度到某几个节点上
2. 初始的节点亲和性 就是 NodeSelector 也就是节点选择器 必须和Node的标签Label互相匹配
3. 节点亲和性功能更加强大,节点选择器将来可能被弃用
4. 每个Pod都可以自定义自己的节点亲和性规则,用来指定硬性限制或者偏好
5. 告诉K8S 我的Pod 更倾向于部署到哪类的Node上,K8S会尽量满足需求,如果无法满足,会将Pod调度到其他节点上。

检查默认的节点标签

1
2
3
4
5
6
7
8
9
10
11
12
13
1. 节点亲和性 也是根据Node节点的标签进行选择 ,和节点选择器一样
2. 检查下Node里的默认标签
3. 可以通过我们自定义的disk=ssd ,当然还可以增加 机房位置,服务器分区等其他标签

[root@k8s-master1 Taints]# kubectl describe node k8s-node1
...
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
disk=ssd # 之前学习标签时候自定义添加的
kubernetes.io/arch=amd64
kubernetes.io/hostname=k8s-node1
kubernetes.io/os=linux
...

指定强制性 节点亲和性 规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

[root@k8s-master1 Taints]# vim kubia-gpu-nodeaffinity.yaml

apiVersion: v1
kind: Pod
metadata:
name: kubia-gpu
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu
operator: In
values:
- "true"
containers:
- image: 172.31.228.68/project/kubia
name: kubia

[root@k8s-master1 Taints]# kubectl create -f kubia-gpu-nodeaffinity.yaml
pod/kubia-gpu created

节点亲和性 属性定义

1
2
3
4
5
6
7
8
9
10
11
12
13
14
1. affinity / nodeAffinity
2. requiredDuringScheduling... 该字段下定义的规则,为了让Pod能调度到这个节点上,明确定义了该节点必须包含的标签
3. ...IgnoredDuringExecution 该字段下定义的规则,不会影响已经在Node节点上运行的Pod
4. 亲和性规则影现在被调度的Pod,而不会导致之前部署的Pod被剔除的风险,需要使用IgnoredDuringExecution结尾
5. nodeSelectorTerms 和 matchExpressions 定义了节点标签必须满足哪一种表达式
6. 当前例子的含义: 节点中必须包含一个 gpu 的标签 , 并且这个标签的值必须是 true

[root@k8s-master1 Taints]# kubectl get nodes -L gpu
NAME STATUS ROLES AGE VERSION GPU
k8s-master1 Ready <none> 21d v1.16.0
k8s-node1 Ready <none> 21d v1.16.0
k8s-node2 Ready <none> 21d v1.16.0 true

# 一个Pod的节点亲和性 指定了节点必须包含哪些标签,才能满足Pod调度的条件

调度Pod时优先考虑哪些节点

1
2
3
1. 大型容器平台不同机房或者地域
2. 提前预留出的资源良好的服务器
3. 基础的两个标签: 区域 和 独占节点还是共享节点

给 node 节点加上标签

1
2
3
4
Region : 区域
share : 共享
dedicated : 专用模式
shared : 共享模式
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[root@k8s-master1 Taints]# kubectl label node k8s-node1 Region=zone1
node/k8s-node1 labeled
[root@k8s-master1 Taints]# kubectl label node k8s-node2 Region=zone2
node/k8s-node2 labeled
[root@k8s-master1 Taints]# kubectl label node k8s-node1 share=dedicated
node/k8s-node1 labeled
[root@k8s-master1 Taints]# kubectl label node k8s-node2 share=shared
node/k8s-node2 labeled

# 删除标签
[root@k8s-master1 Taints]# kubectl label node k8s-node2 Region-
node/k8s-node2 labeled
[root@k8s-master1 Taints]# kubectl label node k8s-node1 Region-
node/k8s-node2 labeled

# 重新添加
[root@k8s-master1 Taints]# kubectl label node k8s-node2 region=zone2
node/k8s-node2 labeled
[root@k8s-master1 Taints]# kubectl label node k8s-node1 region=zone1
node/k8s-node1 labeled

# 查看
[root@k8s-master1 Taints]# kubectl get nodes -L region -L share
NAME STATUS ROLES AGE VERSION REGION SHARE
k8s-master1 Ready <none> 21d v1.16.0
k8s-node1 Ready <none> 21d v1.16.0 zone1 dedicated
k8s-node2 Ready <none> 21d v1.16.0 zone2 shared

指定优先级 节点亲和性 规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
[root@k8s-master1 Taints]# vim preferred-deployment.yaml 

apiVersion: apps/v1
kind: Deployment
metadata:
name: pref
spec:
replicas: 10
selector:
matchLabels:
app: pref
template:
metadata:
labels:
app: pref
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: region
operator: In
values:
- zone1
- weight: 20
preference:
matchExpressions:
- key: share
operator: In
values:
- dedicated
containers:
- args:
- sleep
- "99999"
image: busybox
name: main
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1. 节点优先调度到 zone1       80权重
2. 次要优先级到 dedicated 20权重
3. 当前工作模式如果我们有多个节点比如4个 优先最高的是 zone1 和 dedicated 然后是 zone1 和 shared ,再然后是 其他区域的 dedicated ,最低优先级是剩下的条件

[root@k8s-master1 Taints]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pref-69f48f9b59-4wk8h 1/1 Running 0 2m18s 10.244.0.111 k8s-node1 <none> <none>
pref-69f48f9b59-59h24 1/1 Running 0 2m18s 10.244.2.114 k8s-master1 <none> <none>
pref-69f48f9b59-77cgj 1/1 Running 0 2m18s 10.244.1.150 k8s-node2 <none> <none>
pref-69f48f9b59-bxfqs 1/1 Running 0 2m18s 10.244.0.114 k8s-node1 <none> <none>
pref-69f48f9b59-gsqsc 1/1 Running 0 2m18s 10.244.0.112 k8s-node1 <none> <none>
pref-69f48f9b59-klblh 1/1 Running 0 2m18s 10.244.0.110 k8s-node1 <none> <none>
pref-69f48f9b59-kpbzz 1/1 Running 0 2m18s 10.244.2.115 k8s-master1 <none> <none>
pref-69f48f9b59-s5mg5 1/1 Running 0 2m18s 10.244.1.149 k8s-node2 <none> <none>
pref-69f48f9b59-w4c8j 1/1 Running 0 2m18s 10.244.0.115 k8s-node1 <none> <none>
pref-69f48f9b59-x4csj 1/1 Running 0 2m18s 10.244.0.113 k8s-node1 <none> <none>

# 为什么 k8s-master1 和 k8s-node2 还有Pod可以分配
1. 除了节点亲和性优先级函数,调度器还使用其他优先级函数来决定如何分配,其中还有Selector SpreadPriority函数
2. 该函数确保属于同一个RS或者Servcie的Pod,将分散部署到不同节点上,以免出现单点问题
3. 如果没有添加节点亲和性优先级配置,那么Pod会均匀的分配到3个工作节点上

使用 Pod亲和性 和 非亲和性 对Pod进行协同部署

1
2
3
4
5
1. 上面的操作都是在pod和node之间的亲和性来分配Pod到指定的节点上
2. pod也可以与pod之间产生 Pod亲和性
3. 例如1个前端Pod和1个后端Pod,将这两个Pod部署在比较靠近的区域,可以降低延迟,提高应用的性能
4. 通过节点亲和性可以将这两个Pod调度到同一个区域,或者同一个机架的Node服务器上
5. K8S 还可以通过Pod亲和性,让这两个Pod部署合适的位置,并且确保他们是靠近的

使用Pod亲和性 将多个Pod部署在同一个节点上

1
2
3
4
5
6
7
8
9
# 先部署后端Pod
# 添加了标签 app=backend 这个标签将在前端的pod亲和性里面使用

[root@k8s-master1 Taints]# kubectl run backend -l app=backend --image busybox -- sleep 99999
deployment.apps/backend created

[root@k8s-master1 Taints]# kubectl get pods -L app
NAME READY STATUS RESTARTS AGE APP
backend-75d54556d4-ns276 1/1 Running 0 25s backend
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[root@k8s-master1 Taints]# vim frontend-podaffinity-host.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
spec:
replicas: 5
selector:
matchLabels:
app: frontend
template:
metadata:
labels:
app: frontend
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: backend
containers:
- name: main
image: busybox
args:
- sleep
- "99999"

# requiredDuringScheduling 强制要求
# podAffinity Pod亲和性
# 这个Pod必须调度到匹配 Pod选择器的节点上 app: backend
# Pod亲缘性 允许 Pod 被调度到那些包含指定标签Pod所在的节点
1
2
3
4
5
6
7
8
9
10
11
12
# 查看 全部都被调度到同一个节点
[root@k8s-master1 Taints]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
backend-75d54556d4-ns276 1/1 Running 0 8m22s 10.244.1.151 k8s-node2 <none> <none>
frontend-7bfc78d68d-46lf8 1/1 Running 0 16s 10.244.1.155 k8s-node2 <none> <none>
frontend-7bfc78d68d-6h82c 1/1 Running 0 16s 10.244.1.153 k8s-node2 <none> <none>
frontend-7bfc78d68d-m5fsz 1/1 Running 0 16s 10.244.1.154 k8s-node2 <none> <none>
frontend-7bfc78d68d-mrdxr 1/1 Running 0 16s 10.244.1.156 k8s-node2 <none> <none>
frontend-7bfc78d68d-pcd6k 1/1 Running 0 16s 10.244.1.152 k8s-node2 <none> <none>

# 在调度前端Pod时,调度器最先找到所有匹配前端Pod的podAffinity中配置的 labelSelector 的 Pod
# 之后将前端Pod都调度到同一节点

将Pod部署在同一个区域或者同一个机柜

1
2
3
4
5
6
7

# 前面的例子将所有Pod都部署到了同一个Node上,可能存在单点问题
# 比如我们拥有20台服务器 每5台服务器 在一个机柜 那么机柜就是 1 2 3 4
# 每天Node上都设置rack=机柜(1-4)
# 调度器会先去查询Pod的 podAffinity,需要定义 topologyKey 设置为 rack
# 如果前端的rack=2 那么后端的Pod 就会先考虑rack=2的机柜来部署Pod
# podAffinity中的topologyKey决定了Pod被调度的范围

Pod亲和性优先级

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
1. 告诉K8S 优先调度到相同节点上
2. 如果无法满足也可以到其他节点
3. 根据调度器的其他调度函数,尽量会在同一节点上部署,也会有调度到其他节点上的以免单点

[root@k8s-master1 logs]# vim frontend-podaffinity-preferred-host.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
spec:
replicas: 5
selector:
matchLabels:
app: frontend
template:
metadata:
labels:
app: frontend
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: backend
containers:
- name: main
image: busybox
args:
- sleep
- "99999"

[root@k8s-master1 Taints]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
backend-75d54556d4-ns276 1/1 Running 0 35m 10.244.1.151 k8s-node2 <none> <none>
frontend-6747f764d7-7t6lx 1/1 Running 0 24s 10.244.1.157 k8s-node2 <none> <none>
frontend-6747f764d7-8kf4r 1/1 Running 0 24s 10.244.1.159 k8s-node2 <none> <none>
frontend-6747f764d7-dlkr7 1/1 Running 0 24s 10.244.1.158 k8s-node2 <none> <none>
frontend-6747f764d7-dmspl 1/1 Running 0 24s 10.244.2.116 k8s-master1 <none> <none>
frontend-6747f764d7-j2cl5 1/1 Running 0 24s 10.244.0.116 k8s-node1 <none> <none>

利用 Pod 的非亲和性 分开调度Pod

1
2
3
4
5
1. 如果希望Pod之间远离彼此, 这种特性叫做非亲和性
2. 调度器永远不会将Pod调度到 非亲和的Pod所在的节点上
3. 有两个Pod业务都比较占用资源, 放在同一个Node上会影响彼此
4. 按照可用区分配业务,一个可用区出现问题,另外一个可用区的服务还在
5. 使用 podAntiAffinity 替换 podAffinity
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[root@k8s-master1 Taints]# vim podAntiAffinity.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
spec:
replicas: 5
selector:
matchLabels:
app: frontend
template:
metadata:
labels:
app: frontend
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: frontend
containers:
- name: main
image: busybox
args:
- sleep
- "99999"
1
2
3
4
5
6
7
8
9
10
[root@k8s-master1 Taints]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
backend-75d54556d4-ns276 1/1 Running 0 86m 10.244.1.151 k8s-node2 <none> <none>
frontend-b4667c656-hg26f 0/1 Pending 0 64s <none> <none> <none> <none>
frontend-b4667c656-hzl2b 1/1 Running 0 64s 10.244.1.170 k8s-node2 <none> <none>
frontend-b4667c656-mcd7j 1/1 Running 0 64s 10.244.2.120 k8s-master1 <none> <none>
frontend-b4667c656-tcmlk 0/1 Pending 0 64s <none> <none> <none> <none>
frontend-b4667c656-wmrb8 1/1 Running 0 64s 10.244.0.124 k8s-node1 <none> <none>

# 出现 Pending状态 调度器不允许Pod部署在亲和性Pod的节点上