一、简介
监控对于业务系统的正常运行有着极其重要的作用,本文将介绍prometheus+grafana监控系统的原理架构、功能及部署流程。promethues是一套开源的监控告警系统,被越来越多的公司所接受,并于2016年加入CNCF。其官方系统架构如下:
promethues生态中有丰富的采集插件,通常称为exporter,prometheus通过主动拉取(pull)的方式将exporter暴露的metrics信息存储到自带的TSDB,可视化插件grafana通过promQL查询存储在prometheus的指标数据并展示出来。同时prometheus根据配置好的告警规则,与采集到的指标信息对比,若满足告警信息,则将告警推送到告警组件alertmanager,进而将告警信息通过邮件、钉钉、短信等方式告知相关负责人。
二、当前场景监控架构
一个完整的监控告警系统通常包含指标采集、数据存储、可视化展示、告警通知几个方面。github上的kube-prometheus项目(https://github.com/coreos/kube-prometheus)整合了上述几个方面的内容,以下为结合当前应用场景对kube-prometheus项目的实际应用,如下图为当前应用场景的监控架构图:
相关模块作用说明:
- prometheus
- grafana
- prometheus-adapter
- node-exporter
- mysqld-exporter
- kube-state-metrics
- blackbox
- kafka-exporter
- redis-exporter
- php-fpm-exporter
- prometheus-operator
其中,针对tidb的监控,tidb官方提供监控、tidb应用为一体的部署方式,监控内容包括数据库性能、binlog、服务器性能、tidb服务存活状态等,因其采用二进制部署的方式,所以tidb的监控没有其对应的ServiceMonitor。
相反,在上图中,有ServiceMonitor对应的采集组件均为k8s方式编排部署。
1、采集
ServiceMonitor服务发现
如prometheus要刮到不同的命名空间下的exporter采集的metrics信息,则需要将metrics信息暴露在集群外,一方面增加了配置的复杂度,也增加了安全风险。这时候,ServiceMonitor应运而生。ServiceMonitor是Kubernetes自定义资源,该资源描述了Prometheus Server的Target列表,能通过Selector来依据 Labels 选取对应的Service的endpoints,并让 Prometheus Server 通过 Service 进行拉取,从而实现跨命名空间的动态服务发现。一个ServiceMonitor可对应一类service,只需要将exporter用k8s的方式部署,每个exporter对应一个service,ServiceMonitor根据labelSelector发现相关联的service,进而prometheus可以从service中拉取Metrics信息。如下为创建中间件的serviceMonitor资源生命文件:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: middlewares-exporter
namespace: monitoring
spec:
endpoints:
- interval: 15s
port: http
selector:
matchLabels:
app: middlewares-exporter
label定义为app: middlewares-exporter的service都能被该serviceMonitor发现。
基于文件的文件的服务发现
对于不采用serviceMonitor做服务发现的exporter,需要在promethues.yaml文件配置服务发现,配置步骤请参考第三章第3小节。
2、展示
本系统采用grafana为监控系统的展示模块,在官方提供了丰富的展示模板,在grafana中配置数据源(此处为prometheus数据库),并导入官方提供的模板文件(可参考:https://grafana.com/grafana/dashboards),就能展示对应的数据指标。其模板文件是一个json文件,文件中定义了面板的样式、查询指标的SQL,以及当前面板的告警规则。
3、告警
方式一:alertmanager是独立的告警模块,prometheus根据告警规则将告警信息发送至alertmanager,经过alertmanager处理之后,通过邮件、钉钉等方式将告警信息发送给接收人。此方式可自定义告警模板,支持更多的定制化场景,可对告警静默、抑制,同时也增大了告警规则配置的复杂度,不易维护。
方式二:监控可视化模块grafana自带告警功能,可在展示面板中配置告警规则,告警信息可通过邮件、钉钉等方式通知对应的负责人,grafana的告警功能仅支持graph面板。
grafana自带的告警模块已能满足当前的应用场景,而且方便配置管理。因此本系统使用grafana自带的告警功能
k8s集群通过kube-prometheus部署监控系统
三、prometheus-operator及组件介绍
Prometheus 作为 Kubernetes 监控的事实标准,有着强大的功能和良好的生态。但是它不支持分布式,不支持数据导入、导出,不支持通过 API 修改监控目标和报警规则,所以在使用它时,通常需要写脚本和代码来简化操作。Prometheus Operator 为监控 Kubernetes service、deployment 和 Prometheus 实例的管理提供了简单的定义,简化在 Kubernetes 上部署、管理和运行 Prometheus 和 Alertmanager 集群。
MetricServer:是kubernetes集群资源使用情况的聚合器,收集数据给kubernetes集群内使用,如 kubectl,hpa,scheduler等。
PrometheusOperator:是一个系统监测和警报工具箱,用来存储监控数据。
NodeExporter:用于各node的关键度量指标状态数据。
KubeStateMetrics:收集kubernetes集群内资源对象数据,制定告警规则。
Prometheus:采用pull方式收集apiserver,scheduler,controller-manager,kubelet组件数据,通过http协议传输。
四、部署
1、准备安装所需文件
下载资源文件,并整理分类文件
将官方提供的资源生命文件根据需要进行分类,本文中已适当裁剪不需要的组件,如:alertmanager。
git clone https://github.com/coreos/kube-prometheus.git
cd kube-prometheus/
git branch -r
git checkout origin/release-0.12
cd manifests/
mkdir adapter grafana blackbox alertmanager kube-state-metrics node-exporter serviceMonitor operator prometheus
mv grafana-* grafana
mv blackbox* blackbox/
mv alertmanager-* alertmanager/
mv kubeStateMetrics-* kube-state-metrics/
mv nodeExporter-* node-exporter/
mv *Adapter* adapter/
mv *serviceM* serviceMonitor/
mv *Operator* operator/
mv prometheus-* prometheus/
2、prometheus持久化
在prometheus/prometheus-prometheus.yaml文件的末尾添加一下内容
version: 2.41.0
storage:
volumeClaimTemplate:
spec:
storageClassName: nfs-storage
resources:
requests:
storage: 50Gi
3、grafana持久化
3.1 创建pvc
[root@master1 prometheus]# kubectl apply -f grafana/grafana-pvc.yaml
persistentvolumeclaim/grafana-pvc created
[root@master1 prometheus]# cat grafana/grafana-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana-pvc
namespace: monitoring
labels:
app: grafana-pvc
spec:
accessModes: #指定访问类型
- ReadWriteOnce
volumeMode: Filesystem #指定卷类型
resources:
requests:
storage: 10Gi
storageClassName: nfs-storage #指定创建的存储类的名字
3.2 修改文件grafana/grafana-deployment.yaml文件 挂载pvc
volumes:
#找到这两行并注释注释
#- emptyDir: {}
# name: grafana-storage
#新增以下三行
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-pvc
4、准备部分需要外网的镜像
4.1 下载镜像、打tag,并导出镜像
docker pull registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.7.0
docker pull bitnami/kube-state-metrics:v2.7.0
docker pull bitnami/kube-state-metrics:2.7.0
docker tag bitnami/kube-state-metrics:2.7.0 registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.7.0
docker save -o kube-state-metrics-2.7.0.tar registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.7.0
docker pull v5cn/prometheus-adapter:v0.10.0
docker tag v5cn/prometheus-adapter:v0.10.0 registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.10.0
docker save -o adapter-0.10.0.tar registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.10.0
4.2 将镜像传到其他节点上,并导入镜像
[root@master1 prometheus]# scp adapter-0.10.0.tar master2.k8s.test:/tmp
[root@master1 prometheus]# scp adapter-0.10.0.tar master3.k8s.test:/tmp
[root@master1 prometheus]# scp adapter-0.10.0.tar node1.k8s.test:/tmp
[root@master1 prometheus]# scp kube-state-metrics-2.7.0.tar master2.k8s.test:/tmp
[root@master1 prometheus]# scp kube-state-metrics-2.7.0.tar master3.k8s.test:/tmp
[root@master1 prometheus]# scp kube-state-metrics-2.7.0.tar node1.k8s.test:/tmp
4.2 登录其他节点,导入镜像
[root@master2 ~]# docker load -i /tmp/kube-state-metrics-2.7.0.tar
[root@master2 ~]# docker load -i /tmp/adapter-0.10.0.tar
5、配置prometheus服务动态发现
对于部署好的prometheus,我们常需要对其添加配置。在此,prometheus使用配置文件(prometheus-additional.yaml)创建secret然后挂载到prometheus的pod内的方式实现配置的改动。对于不采用serviceMonitor做服务发现的exporter,可以在prometheus-additional.yaml文件中添加发现的配置,然后重新生成secret。
5.1 prometheus-additional.yaml文件内容:
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
5.2 配置prometheus/prometheus.prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
...
replicas: 2
resources:
requests:
memory: 400Mi
additionalScrapeConfigs: #配置服务发现功能
name: additional-configs #secret 资源对象名称
key: prometheus-additional.yaml #secret 对象中的key
5.3 创建secret对象并验证:
后续该文件如有变更,重新删除secret,并重新执行下面创建secret的命令即可,默认30s后生效
[root@master1 prometheus]# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
secret/additional-configs created
[root@master1 prometheus]# kubectl get secret -n monitoring
NAME TYPE DATA AGE
additional-configs Opaque 1 42s
default-token-6nnl7 kubernetes.io/service-account-token 3 148m
grafana-config Opaque 1 137m
grafana-datasources Opaque 1 137m
grafana-token-rhsf9 kubernetes.io/service-account-token 3 137m
kube-state-metrics-token-x7ns4 kubernetes.io/service-account-token 3 136m
node-exporter-token-6dhnm kubernetes.io/service-account-token 3 137m
prometheus-adapter-token-k8zrn kubernetes.io/service-account-token 3 135m
prometheus-k8s-token-nxmg2 kubernetes.io/service-account-token 3 136m
[root@master1 prometheus]# kubectl get secret -n monitoring additional-configs
NAME TYPE DATA AGE
additional-configs Opaque 1 52s
[root@master1 prometheus]# kubectl get secret -n monitoring additional-configs -oyaml
apiVersion: v1
data:
prometheus-additional.yaml: LSBqb2JfbmFtZTogImh0dHBfMnh4LWFwaSIKICBzY3JhcGVfaW50ZXJ2YWw6IDEwcwogIHNjcmFwZV90aW1lb3V0OiA1cwogIG1ldHJpY3NfcGF0aDogL3Byb2JlCiAgcGFyYW1zOgogICAgbW9kdWxlOiBbaHR0cF8yeHhdCiAgc3RhdGljX2NvbmZpZ3M6CiAgLSB0YXJnZXRzOgogICAgLSBodHRwczovL3d3dy5iYWlkdS5jb20KICByZWxhYmVsX2NvbmZpZ3M6CiAgLSBzb3VyY2VfbGFiZWxzOiBbX19hZGRyZXNzX19dCiAgICB0YXJnZXRfbGFiZWw6IF9fcGFyYW1fdGFyZ2V0CiAgLSBzb3VyY2VfbGFiZWxzOiBbX19wYXJhbV90YXJnZXRdCiAgICB0YXJnZXRfbGFiZWw6IGluc3RhbmNlCiAgLSB0YXJnZXRfbGFiZWw6IF9fYWRkcmVzc19fCiAgICByZXBsYWNlbWVudDogYmxhY2tib3gtZXhwb3J0ZXI6OTExNQo=
kind: Secret
metadata:
creationTimestamp: "2023-05-25T05:06:43Z"
name: additional-configs
namespace: monitoring
resourceVersion: "2953034"
uid: 42593297-e962-4352-82f4-ed07388b190c
type: Opaque
[root@master1 prometheus]# echo "LSBqb2JfbmFtZTogImh0dHBfMnh4LWFwaSIKICBzY3JhcGVfaW50ZXJ2YWw6IDEwcwogIHNjcmFwZV90aW1lb3V0OiA1cwogIG1ldHJpY3NfcGF0aDogL3Byb2JlCiAgcGFyYW1zOgogICAgbW9kdWxlOiBbaHR0cF8yeHhdCiAgc3RhdGljX2NvbmZpZ3M6CiAgLSB0YXJnZXRzOgogICAgLSBodHRwczovL3d3dy5iYWlkdS5jb20KICByZWxhYmVsX2NvbmZpZ3M6CiAgLSBzb3VyY2VfbGFiZWxzOiBbX19hZGRyZXNzX19dCiAgICB0YXJnZXRfbGFiZWw6IF9fcGFyYW1fdGFyZ2V0CiAgLSBzb3VyY2VfbGFiZWxzOiBbX19wYXJhbV90YXJnZXRdCiAgICB0YXJnZXRfbGFiZWw6IGluc3RhbmNlCiAgLSB0YXJnZXRfbGFiZWw6IF9fYWRkcmVzc19fCiAgICByZXBsYWNlbWVudDogYmxhY2tib3gtZXhwb3J0ZXI6OTExNQo=" |base64 -d
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
#prometheus容器内验证
[root@master1 prometheus]# kubectl exec -n monitoring prometheus-k8s-0 -it sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
/prometheus $ cat /etc/prometheus/config_out/prometheus.env.yaml |grep kubernetes-service -A 3
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
/prometheus $
6、开始安装部署
[root@master1 prometheus]# kubectl create -f setup/ -f operator/ -f adapter/ -f grafana/ -f serviceMonitor/ -f blackbox/ -f kube-state-metrics/ -f node-exporter/ -f prometheus/
7、验证
[root@master1 prometheus]# kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
blackbox-exporter-58d99cfb6d-bxvsb 3/3 Running 0 56m
grafana-69b474cbc8-8zqh9 1/1 Running 0 56m
kube-state-metrics-c9f8b947b-fgzqx 3/3 Running 0 56m
node-exporter-6sb9v 2/2 Running 0 56m
node-exporter-9v7bq 2/2 Running 0 56m
node-exporter-cv9xt 2/2 Running 0 56m
node-exporter-nm86z 2/2 Running 0 56m
prometheus-adapter-5bf8d6f7c6-nspkp 1/1 Running 0 52m
prometheus-adapter-5bf8d6f7c6-xn9fw 1/1 Running 0 52m
prometheus-k8s-0 2/2 Running 0 65s
prometheus-k8s-1 2/2 Running 0 65s
prometheus-operator-6958d799cd-nqscz 2/2 Running 0 56m
#验证持久化pvc
[root@master1 prometheus]# ls /data/nfs_provisioner/monitoring-prometheus-k8s-db-prometheus-k8s-0-pvc-4464e6f6-8966-4b25-994d-8cd498961404/prometheus-db/
chunks_head lock queries.active wal
[root@master1 prometheus]# ls /data/nfs_provisioner/monitoring-prometheus-k8s-db-prometheus-k8s-1-pvc-717f082f-4259-4718-81db-89d11f9c2bd9/prometheus-db/
chunks_head lock queries.active wal
[root@master1 prometheus]# ls /data/nfs_provisioner/monitoring-grafana-pvc-pvc-388b94be-22fe-4139-b582-8f238860cd3f/
alerting csv file-collections grafana.db plugins png
8、配置集群外部访问
8.1 删除网络策略
[root@master1 prometheus]# kubectl -n monitoring delete networkpolicies.networking.k8s.io --all
8.2 通过ingress实现外部访问
8.2.1 配置grafana-ing
#查看service名称:grafana
[root@master1 prometheus]# kubectl get svc -n monitoring|grep grafana
grafana NodePort 10.96.131.142 <none> 3000:30030/TCP 130m
#编写ing配置文件
[root@master1 prometheus]# cat ../ingress/grafana-ing.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana-ing
namespace: monitoring
spec:
rules:
- host: grafana.example.com
http:
paths:
- backend:
service:
name: grafana
port:
number: 3000
path: /
pathType: Prefix
#创建ingress
[root@master1 prometheus]# kubectl create -f grafana/grafana-ing.yaml
ingress.networking.k8s.io/grafana-ing created
8.2.2 配置prometheus-ing
#查看service名称:prometheus-k8s
[root@master1 prometheus]# kubectl get svc -n monitoring|grep prometheus-k8s
prometheus-k8s NodePort 10.96.192.126 <none> 9090:30090/TCP,8080:32245/TCP 74m
#编写ing配置文件
[root@master1 prometheus]# cat prometheus/prometheus-ing.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: prometheus-k8s-ing
namespace: monitoring
spec:
rules:
- host: prometheus-k8s.example.com
http:
paths:
- backend:
service:
name: prometheus-k8s
port:
number: 9090
path: /
pathType: Prefix
#创建ingress
[root@master1 prometheus]# kubectl create -f prometheus/prometheus-ing.yaml
ingress.networking.k8s.io/prometheus-k8s-ing created
8.2.3 验证
[root@master1 prometheus]# kubectl get ing -n monitoring
NAME CLASS HOSTS ADDRESS PORTS AGE
grafana-ing nginx grafana.example.com 10.96.117.202 80 2m9s
prometheus-k8s-ing nginx prometheus-k8s.example.com 10.96.117.202 80 2m17s
8.3 通过nodePort实现外部访问
8.3.1 配置grafana nodeport
#修改配置文件:添加
type: NodePort
nodePort: 30030
[root@master1 prometheus]# cat grafana/grafana-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 9.3.2
name: grafana
namespace: monitoring
spec:
ports:
- name: http
port: 3000
nodePort: 30030 #新增nodePort端口
targetPort: http
type: NodePort #新增svc类型未NodePort
selector:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
#应用生效
[root@master1 prometheus]# kubectl apply -f grafana/grafana-service.yaml
service/grafana configured
#验证
[root@master1 prometheus]# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana NodePort 10.96.131.142 <none> 3000:30030/TCP 119m
119m
8.3.2 配置prometheus nodeport
#修改配置文件:添加
type: NodePort
nodePort: 30090
[root@master1 prometheus]# cat prometheus/prometheus-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.41.0
name: prometheus-k8s
namespace: monitoring
spec:
ports:
- name: web
port: 9090
nodePort: 30090 #新增nodePort端口
targetPort: web
- name: reloader-web
port: 8080
targetPort: reloader-web
type: NodePort #新增svc类型未NodePort
selector:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
sessionAffinity: ClientIP
#应用生效
[root@master1 prometheus]# kubectl apply -f prometheus/prometheus-service.yaml
Warning: resource services/prometheus-k8s is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
service/prometheus-k8s configured
#验证
[root@master1 prometheus]# kubectl get svc -n monitoring|grep prometheus
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-k8s NodePort 10.96.192.126 <none> 9090:30090/TCP,8080:32245/TCP 69m
五、验证
1、本机添加hosts解析
修改C:\Windows\System32\drivers\etc\hosts文件
2、浏览器访问验证
2.1 验证grafana
- grafana默认账号密码为admin\admin
2.2 验证prometheus