15 k8s 全方位监控 Kubernetes


K8S 监控方案

两种监控方案

1
2
1. cAdvisor+Heapster+InfluxDB+Grafana	    
2. cAdvisor/exporter+Prometheus+Grafana
  1. 对于非容器化业务来说,像Zabbix,open-falcon已经在企业深入使用。
  2. 而Prometheus新型的监控系统的兴起来源容器领域,所以重点是放在怎么监控容器。随着容器化大势所趋,如果传统技术不随着改变,将会被淘汰,基础架构也会发生新的技术栈组合。

cAdvisor+Heapster+InfluxDB+Grafana

1
2
3
4
5
6
7
8
9
1. cAdvisor 采集所有容器的性能指标,与kubelet集成
2. Heapster 汇总数据
3. InfluxDB 时间序列数据库
4. Grafana 展示数据

缺点:
1. 无法对业务进行监控 Heapster
2. 扩展性差
3. Heapster 被淘汰 替代者是 Metrics Server

cAdvisor/exporter+Prometheus+Grafana

1
2
3
4
5
1. cAdvisor 采集容器性能指标 
2. node_exporter 对node监控
3. Prometheus 汇总数据
4. Grafana 展示数据
5. kube-state-metrics -> apiserver -> etcd

K8S 监控指标

Kubernetes本身监控

  1. Node资源利用率
  2. Node数量
  3. Pods数量(Node) • 资源对象状态• Node资源利用率
  4. Node数量
  5. Pods数量(Node) • 资源对象状态

Pod监控

  1. Pod数量(项目)
  2. 容器资源利用率
  3. 应用程序

实现思路

  1. k8s中的pod都是动态创建的,不能每次都在Prometheus配置文件中去写
  2. 所有要使用服务发现
  3. k8s的服务发现是从 k8sapi中发现目标 ,并且获取当前状态,随着pod的生命周期采集数据
1
2
3
4
5
6
7
8
9
服务发现:
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config

支持监控的维度:
1. node 自动发现集群中的node节点
2. pod 自动发现运行的容器和端口
3. service 自动发现创建的serviceIP、端口
4. endpoints 自动发现pod中的容器
5. ingress 自动发现创建的访问入口和规则

在 K8S 中部署 Prometheus

准备工作

1
2
3
4
1. k8s集群
2. 存储: nfs 自动供给
3. 部署参考:
https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/prometheus
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@k8s-master1 opt]# unzip prometheus-k8s\ .zip 

[root@k8s-master1 prometheus-k8s]# ls -l

-rw-r--r-- 1 root root 652 Feb 9 2019 alertmanager-configmap.yaml
-rw-r--r-- 1 root root 2183 Feb 7 2019 alertmanager-deployment.yaml
-rw-r--r-- 1 root root 331 Feb 9 2019 alertmanager-pvc.yaml
-rw-r--r-- 1 root root 372 Feb 7 2019 alertmanager-service.yaml
-rw-r--r-- 1 root root 1198 Feb 8 2019 grafana.yaml
-rw-r--r-- 1 root root 2377 Feb 8 2019 kube-state-metrics-deployment.yaml
-rw-r--r-- 1 root root 2240 Feb 7 2019 kube-state-metrics-rbac.yaml
-rw-r--r-- 1 root root 506 Feb 7 2019 kube-state-metrics-service.yaml
-rw-r--r-- 1 root root 1495 Feb 7 2019 node-exporter-ds.yml
-rw-r--r-- 1 root root 425 Feb 7 2019 node-exporter-service.yaml
-rw-r--r-- 1 root root 646 Feb 2 2019 node_exporter.sh
-rw-r--r-- 1 root root 99 Feb 7 2019 OWNERS
-rw-r--r-- 1 root root 5131 Feb 10 2019 prometheus-configmap.yaml
-rw-r--r-- 1 root root 1080 Feb 7 2019 prometheus-rbac.yaml
-rw-r--r-- 1 root root 1802 Feb 9 2019 prometheus-rules.yaml
-rw-r--r-- 1 root root 370 Feb 7 2019 prometheus-service.yaml
-rw-r--r-- 1 root root 3539 Feb 10 2019 prometheus-statefulset-static-pv.yaml
-rw-r--r-- 1 root root 3259 Feb 10 2019 prometheus-statefulset.yaml
-rw-r--r-- 1 root root 349 Feb 7 2019 README.md
1
2
3
4
5
6
# k8s 基础组件
[root@k8s-master1 prometheus-k8s]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master1 Ready <none> 8d v1.16.0
k8s-node1 Ready <none> 8d v1.16.0
k8s-node2 Ready <none> 8d v1.16.0
1
2
3
4
5
# nfs pv自动供给
[root@k8s-master1 prometheus-k8s]# kubectl get pods
NAME READY STATUS RESTARTS AGE
jenkins-6459665769-2dzp7 1/1 Running 1 25h
nfs-client-provisioner-769f87c8f6-sdk9q 1/1 Running 1 25h
1
2
3
[root@k8s-master1 prometheus-k8s]# kubectl get sc
NAME PROVISIONER AGE
managed-nfs-storage fuseim.pri/ifs 43h

k8s 部署 prometheus

1
2
3
4
5
6
7
8
9
10
11
12
13
# rbac      访问kubeapi授权,无需变动

# configmap 配置文件 修改node地址
[root@k8s-master1 prometheus-k8s]# vim prometheus-configmap.yaml
...
- job_name: kubernetes-nodes
scrape_interval: 30s
static_configs:
- targets:
- 172.31.228.50:9100
- 172.31.228.52:9100
- 172.31.228.53:9100
...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# prometheus-statefulset.yaml 有状态部署
# 修改 storageClassName
volumeClaimTemplates:
- metadata:
name: prometheus-data
spec:
storageClassName: managed-nfs-storage

# 初始化安装没有rules,先注释掉
volumeMounts:
- name: config-volume
mountPath: /etc/config
- name: prometheus-data
mountPath: /data
subPath: ""
#- name: prometheus-rules
# mountPath: /etc/config/rules

terminationGracePeriodSeconds: 300
volumes:
- name: config-volume
configMap:
name: prometheus-config
#- name: prometheus-rules
# configMap:
# name: prometheus-rules
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 部署:
[root@k8s-master1 prometheus-k8s]# kubectl apply -f prometheus-rbac.yaml
[root@k8s-master1 prometheus-k8s]# kubectl apply -f prometheus-configmap.yaml
[root@k8s-master1 prometheus-k8s]# kubectl apply -f prometheus-statefulset.yaml
[root@k8s-master1 prometheus-k8s]# kubectl apply -f prometheus-service.yaml

[root@k8s-master1 prometheus-k8s]# kubectl get pods,svc -n kube-system
NAME READY STATUS RESTARTS AGE
pod/coredns-6d8cfdd59d-k5wl9 1/1 Running 15 8d
pod/kube-flannel-ds-amd64-2k5kz 1/1 Running 8 8d
pod/kube-flannel-ds-amd64-gvs6b 1/1 Running 15 8d
pod/kube-flannel-ds-amd64-hwglz 1/1 Running 15 8d
pod/prometheus-0 2/2 Running 0 20m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 8d
service/prometheus NodePort 10.0.0.254 <none> 9090:32037/TCP 29m

# web访问
http://47.240.12.8:32037/graph

基于 K8S 服务发现的配置解析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 用于修改 configmap 后 重新加载

containers:
- name: prometheus-server-configmap-reload
image: "jimmidyson/configmap-reload:v0.1"
imagePullPolicy: "IfNotPresent"
args:
- --volume-dir=/etc/config
- --webhook-url=http://localhost:9090/-/reload
volumeMounts:
- name: config-volume
mountPath: /etc/config
readOnly: true
resources:
limits:
cpu: 10m
memory: 10Mi
requests:
cpu: 10m
memory: 10Mi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 进入容器 查看配置文件
# 需要掌握 重命名标签
[root@k8s-master1 prometheus-k8s]# kubectl exec -it prometheus-0 sh -c prometheus-server -n kube-system
/prometheus $ vi /etc/config/prometheus.yml

# 采集自己
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090

- job_name: kubernetes-apiservers
# k8s服务发现
kubernetes_sd_configs:
- role: endpoints
# 重新标记
relabel_configs:
# 保留正则匹配的标签
- action: keep
regex: default;kubernetes;https
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
scheme: https
tls_config:
# 证书默认生
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# 跳过证书认证
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

# 可以通过 ip+端口6443访问 自身支持

1
2
3
4
5
6
7
8
9
10
11
12
# kubelet的服务端口是 10250
- job_name: kubernetes-nodes-kubelet
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
- job_name: kubernetes-service-endpoints                                
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name

监控 K8S 集群中Pod

  1. kubelet的节点使用cAdvisor提供的metrics接口获取该节点所有容器相关的性能指标数据
  2. cAdvisor 已经集成到 kubelet 中
1
2
3
4
暴露接口地址:
https://NodeIP:10255/metrics/cadvisor # cadvisor 自身端口
https://NodeIP:10250/metrics/cadvisor # kubelet 方式 建议使用
# https://172.31.228.50:10250/metrics/cadvisor

在 K8S中部署 Grafana

  1. Grafana是一个开源的度量分析和可视化系统。
  2. Grafana 也部署在 k8s 集群内
1
2
3
4
5
6
官网:https://grafana.com/grafana/download

推荐模板:
1. 集群资源监控:3119
2. 资源状态监控 :6417
3. Node监控 :9276
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[root@k8s-master1 prometheus-k8s]# kubectl apply -f grafana.yaml 

[root@k8s-master1 prometheus-k8s]# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6d8cfdd59d-k5wl9 1/1 Running 15 8d
grafana-0 1/1 Running 0 62s
kube-flannel-ds-amd64-2k5kz 1/1 Running 8 8d
kube-flannel-ds-amd64-gvs6b 1/1 Running 15 8d
kube-flannel-ds-amd64-hwglz 1/1 Running 15 8d
prometheus-0 2/2 Running 0 4h11m

[root@k8s-master1 prometheus-k8s]# kubectl get svc -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana NodePort 10.0.0.220 <none> 80:30007/TCP 5m41s
kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 8d
prometheus NodePort 10.0.0.254 <none> 9090:32037/TCP 4h24m


# web访问 http://nodeip:30007
http://47.240.12.8:30007/login admin/admin 修改密码
# 添加数据源
# Prometheus url = http://prometheus:9090
# 资源监控模板 3119

# 硬盘资源 之前查询的是 xvda 虚拟磁盘 根据自己的磁盘情况df -h 查看 修改成vda 即可查询到数据,修改后保存

监控K8S集群Node

  1. node_exporter:用于*NIX系统监控,使用Go语言编写的收集器。
  2. node_exporter 在k8s中使用DaemonSet方式部署,每个node都会启动一个收集器
  3. 本次收集不通过yaml方式部署 yaml方式局限于数据获取不到,挂载麻烦
  4. 每台node节点都需要部署
    1
    2
    3
    使用文档:https://prometheus.io/docs/guides/node-exporter/
    GitHub:https://github.com/prometheus/node_exporter
    exporter列表:https://prometheus.io/docs/instrumenting/exporters/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 部署脚本
[root@k8s-master1 prometheus-k8s]# vim node_exporter.sh

#!/bin/bash

wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz

tar zxf node_exporter-0.17.0.linux-amd64.tar.gz
mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter

cat <<EOF >/usr/lib/systemd/system/node_exporter.service
[Unit]
Description=https://prometheus.io

[Service]
Restart=on-failure
ExecStart=/usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable node_exporter
systemctl restart node_exporter
1
2
# 测试访问
http://47.240.12.8:9100/metrics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 接入监控 修改配置文件
[root@k8s-master1 prometheus-k8s]# vim prometheus-configmap.yaml
...
data:
prometheus.yml: |
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090

# node 手动指定
scrape_configs:
- job_name: kubernetes-nodes
static_configs:
- targets:
- 172.31.228.50:9100
- 172.31.228.52:9100
- 172.31.228.53:9100
...

# 手动生效
[root@k8s-master1 prometheus-k8s]# kubectl apply -f prometheus-configmap.yaml
configmap/prometheus-config configured

1
# 导入模板: K8S Node监控 :9276


100 - (avg(irate(node_cpu_seconds_total{instance=~”$node”,mode=”idle”}[1m])) * 100)

监控 K8S 资源对象与Grafana可视化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
kube-state-metrics采集了k8s中各种资源对象的状态信息:
kube_daemonset_*
kube_deployment_*
kube_job_*
kube_namespace_*
kube_node_*
kube_persistentvolumeclaim_*
kube_pod_container_*
kube_pod_*
kube_replicaset_*
kube_service_*
kube_statefulset_*

# 官方文档
https://github.com/kubernetes/kube-state-metrics
1
2
3
[root@k8s-master1 prometheus-k8s]# kubectl apply -f kube-state-metrics-rbac.yaml 
[root@k8s-master1 prometheus-k8s]# kubectl apply -f kube-state-metrics-deployment.yaml
[root@k8s-master1 prometheus-k8s]# kubectl apply -f kube-state-metrics-service.yaml
1
# 导入模板 k8s 资源对象状态监控 :6417

在 K8S中部署 Alertmanager

  1. 部署Alertmanager
  2. 配置Prometheus与Alertmanager通信
  3. 配置告警
1
2
3
4
1. prometheus指定rules目录
2. configmap存储告警规则
3. configmap挂载到容器rules目录
4. 增加alertmanager告警配置
1
2
3
4
[root@k8s-master1 prometheus-k8s]# kubectl apply -f  alertmanager-configmap.yaml 
[root@k8s-master1 prometheus-k8s]# kubectl apply -f alertmanager-pvc.yaml
[root@k8s-master1 prometheus-k8s]# kubectl apply -f alertmanager-deployment.yaml
[root@k8s-master1 prometheus-k8s]# kubectl apply -f alertmanager-service.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@k8s-master1 prometheus-k8s]# kubectl get pods,svc -n kube-system
NAME READY STATUS RESTARTS AGE
pod/alertmanager-7866dbb64c-z2kcg 2/2 Running 0 112s
pod/coredns-6d8cfdd59d-k5wl9 1/1 Running 15 8d
pod/grafana-0 1/1 Running 0 87m
pod/kube-flannel-ds-amd64-2k5kz 1/1 Running 8 8d
pod/kube-flannel-ds-amd64-gvs6b 1/1 Running 15 8d
pod/kube-flannel-ds-amd64-hwglz 1/1 Running 15 8d
pod/kube-state-metrics-5c656f9944-cng55 2/2 Running 0 21m
pod/prometheus-0 2/2 Running 0 93m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/alertmanager ClusterIP 10.0.0.110 <none> 80/TCP 56s
service/grafana NodePort 10.0.0.132 <none> 80:30007/TCP 87m
service/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 8d
service/kube-state-metrics ClusterIP 10.0.0.251 <none> 8080/TCP,8081/TCP 20m
service/prometheus NodePort 10.0.0.66 <none> 9090:31078/TCP 96m

配置 Prometheus 与 Alertmanager 通信

[root@k8s-master1 prometheus-k8s]# vim prometheus-configmap.yaml

1
2
3
4
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:80"]

配置告警

  1. prometheus指定rules目录
1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@k8s-master1 prometheus-k8s]# vim prometheus-configmap.yaml 
data:
prometheus.yml: |
rule_files:
- /etc/config/rules/*.rules

scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
...

[root@k8s-master1 prometheus-k8s]# kubectl apply -f prometheus-configmap.yaml

configmap存储告警规则

1
2
# 测试告警规则 无需修改
[root@k8s-master1 prometheus-k8s]# vim prometheus-rules.yaml

configmap挂载到容器rules目录

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@k8s-master1 prometheus-k8s]# kubectl apply -f prometheus-rules.yaml 
[root@k8s-master1 prometheus-k8s]# kubectl apply -f prometheus-statefulset.yaml

[root@k8s-master1 prometheus-k8s]# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
alertmanager-7866dbb64c-z2kcg 2/2 Running 0 22m
coredns-6d8cfdd59d-k5wl9 1/1 Running 15 8d
grafana-0 1/1 Running 0 107m
kube-flannel-ds-amd64-2k5kz 1/1 Running 8 8d
kube-flannel-ds-amd64-gvs6b 1/1 Running 15 8d
kube-flannel-ds-amd64-hwglz 1/1 Running 15 8d
kube-state-metrics-5c656f9944-cng55 2/2 Running 0 41m
prometheus-0 2/2 Running 0 53s

增加alertmanager告警配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
[root@k8s-master1 prometheus-k8s]# vim alertmanager-configmap.yaml 

apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: '365042337@qq.com'
smtp_auth_username: '365042337@qq.com'
smtp_auth_password: 'jtebrxpgcuyjcafi'
# smtp_require_tls: false

receivers:
- name: default-receiver
email_configs:
- to: "365042337@qq.com"

route:
group_interval: 1m
group_wait: 10s
receiver: default-receiver
repeat_interval: 1m

[root@k8s-master1 prometheus-k8s]# kubectl apply -f alertmanager-configmap.yaml

[root@k8s-master1 prometheus-k8s]# kubectl exec -it prometheus-0 sh -c prometheus-server -n kube-system
/prometheus $ ls /etc/config/rules/
general.rules node.rules

/prometheus $ cat /etc/config/rules/general.rules
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: error
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止5分钟以上."
/prometheus $


[root@k8s-master1 prometheus-k8s]# kubectl exec -it alertmanager-7866dbb64c-z2kcg sh -n kube-system
/alertmanager # cat /etc/config/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: '365042337@qq.com'
smtp_auth_username: '365042337@qq.com'
smtp_auth_password: 'jtebrxpgcuyjcafi'
# smtp_require_tls: false

receivers:
- name: default-receiver
email_configs:
- to: "365042337@qq.com"

route:
group_interval: 1m
group_wait: 10s
receiver: default-receiver
repeat_interval: 1m

1
2
# 关闭一个node_exporter 看看是否告警
[root@k8s-node1 ~]# systemctl stop node_exporter

小结

  1. 标签重要性(环境,部门,项目,管理者)
  2. Grafana灵活
  3. PromSQL
  4. 利用服务发现动态加入目标

下一步计划:Prometheus集群, PromSQL, Grafana,对业务监控