Prometheus监控
Prometheus简介
Prometheus是一套开源的系统监控报警框架。Prometheus作为新一代的云原生监控系统,相比传统监控监控系统(Nagios或者Zabbix)拥有如下优点
- 易管理性
Prometheus: Prometheus核心部分只有一个单独的二进制文件,可直接在本地工作,不依赖于分布式存储。
Nagios: 需要有专业的人员进行安装,配置和管理,并且过程很复杂 - 业务数据相关性
Prometheus:监控服务的运行状态,基于Prometheus丰富的Client库,用户可以轻松的在应用程序中添加对Prometheus的支持,从而让用户可以获取服务和应用内部真正的运行状态。
Nagios:大部分的监控能力都是围绕系统的一些边缘性的问题,主要针对系统服务和资源的状态以及应用程序的可用性。 - 高效:单一Prometheus可以处理数以百万的监控指标;每秒处理数十万的数据点。
- **易于伸缩:**通过使用功能分区(sharing)+联邦集群(federation)可以对Prometheus进行扩展,形成一个逻辑集群;Prometheus提供多种语言的客户端SDK,这些SDK可以快速让应用程序纳入到Prometheus的监控当中。
- 良好的可视化:Prometheus除了自带有Prometheus UI,Prometheus还提供了一个独立的基于Ruby On Rails的Dashboard解决方案Promdash。另外最新的Grafana可视化工具也提供了完整的Proetheus支持,基于Prometheus提供的API还可以实现自己的监控可视化UI。
Prometheus框架的组成和工作流
Prometheus的框架组成
组件介绍
- Prometheus Server:
Prometheus Sever是Prometheus组件中的核心部分,负责实现对监控数据的获取,存储及查询。Prometheus Server可以通过静态配置管理监控目标,也可以配合使用Service Discovery的方式动态管理监控目标,并从这些监控目标中获取数据。其次Prometheus Sever需要对采集到的数据进行存储,Prometheus Server本身就是一个实时数据库,将采集到的监控数据按照时间序列的方式存储在本地磁盘当中。Prometheus Server对外提供了自定义的PromQL,实现对数据的查询以及分析。另外Prometheus Server的联邦集群能力可以使其从其他的Prometheus Server实例中获取数据。 - Exporters:
Exporter将监控数据采集的端点通过HTTP服务的形式暴露给Prometheus Server,Prometheus Server通过访问该Exporter提供的Endpoint端点,即可以获取到需要采集的监控数据。可以将Exporter分为2类: - 直接采集:
这一类Exporter直接内置了对Prometheus监控的支持,比如cAdvisor,Kubernetes,Etcd,Gokit等,都直接内置了用于向Prometheus暴露监控数据的端点。 - 间接采集:
原有监控目标并不直接支持Prometheus,因此需要通过Prometheus提供的Client Library编写该监控目标的监控采集程序。例如:Mysql Exporter,JMX Exporter,Consul Exporter等。 - AlertManager:
在Prometheus Server中支持基于Prom QL创建告警规则,如果满足Prom QL定义的规则,则会产生一条告警。在AlertManager从 Prometheus server 端接收到 alerts后,会进行去除重复数据,分组,并路由到对收的接受方式,发出报警。常见的接收方式有:电子邮件,pagerduty,webhook 等。 - PushGateway:
Prometheus数据采集基于Prometheus Server从Exporter pull数据,因此当网络环境不允许Prometheus Server和Exporter进行通信时,可以使用PushGateway来进行中转。通过PushGateway将内部网络的监控数据主动Push到Gateway中,Prometheus Server采用针对Exporter同样的方式,将监控数据从PushGateway pull到Prometheus Server。 - PromQL
Prometheus内置语法是自己开发的数据查询 DSL 语言
瞬时数据 (Instant vector): 包含一组时序,每个时序只有一个点,例如:http_requests_total
区间数据 (Range vector): 包含一组时序,每个时序有多个点,例如:http_requests_total[5m]
纯量数据 (Scalar): 纯量只有一个数字,没有时序,例如:count(http_requests_total)
- grafana
Prometheus自己提供的webui界面,过于简陋,需要使用grafana的webui,进行告警数据的页面展示
Prometheus的工作流:
1.Prometheus server定期从配置好的jobs或者exporters中拉取metrics,或者接收来自Pushgateway发送过来的metrics,或者从其它的Prometheus server中拉metrics。
2.Prometheus server在本地存储收集到的metrics,并运行定义好的alerts.rules,记录新的时间序列或者向Alert manager推送警报。
3.Alertmanager根据配置文件,对接收到的警报进行处理,发出告警。
4.在图形界面中,可视化采集数据
Prometheus 安装教程
本次安装通过kubernetes进行Prometheus 和Alertmanager,以及grafanaweb页面的安装教程,kubernetes安装教程网上都有,可以去参考
- 编写Prometheus配置文件
cat > Prometheus-configMap.yml<<'EOF'
---
apiVersion: v1
kind: Namespace
metadata:
name: prom
##上面这是创建一个namespace命名空间
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prom-cm
namespace: prom
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'mysql'
static_configs:
- targets: ['mysql.prom:9104']
EOF
## 下面这是configmap的配置文件,用于存放Prometheus.yml配置文件(本次采用的是静态的的采集地址,后面可以使用自动发现的规则,详细请参考官网)
kubectl apply -f Prometheus-configMap.yml
- 编写Prometheus的启动文件
这里面使用的是pv/pvc存储模式,生成pv 、pvc、deployment、svc、以及下面rbac鉴权,以及角色绑定
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: prom-pv
namespace: prom
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Recycle
storageClassName: local-prom
local:
path: /data/prome
nodeAffinity: ##这个地方是node节点亲和性,可以参考官网解释
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- node-1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prom-pvc
namespace: prom
spec:
accessModes:
- ReadWriteOnce
storageClassName: local-prom
resources:
requests:
storage: 5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prome
labels:
app: prometheus
namespace: prom
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
name: prome
labels:
app: prometheus
namespace: prom
spec:
volumes:
- name: prom-data
persistentVolumeClaim:
claimName: prom-pvc
- name: prom-cm
configMap:
name: prom-cm
initContainers: ##初始化操作,参考官网
- name: init
image: busybox
imagePullPolicy: IfNotPresent
command: [chown, -R,"nobody:nobody", /prometheus]
volumeMounts:
- name: prom-data
mountPath: /prometheus
serviceAccountName: k8s-prom
containers:
- name: prome
image: prom/prometheus:latest
imagePullPolicy: IfNotPresent
ports:
- name: prome
containerPort: 9090
args:
- "--config.file=/etc/prometheus/prometheus.yml" #读取上面Prometheus-configmap 的配置文件
- "--storage.tsdb.path=/prometheus" #数据存放路径
- "--storage.tsdb.retention.time=30d" #存放天数
- "--web.enable-admin-api" #下面这是 支持热更新
- "--web.enable-lifecycle"
#livenessProbe: ##这里注释的是探针,有兴趣可以打开,可以研究
# httpGet:
# port: 9090
# path: /
#initialDelaySeconds: 80
#failureThreshold: 3
#periodSeconds: 10
#successThreshold: 1
#readinessProbe:
# httpGet:
# port: 9090
# path: /
#initialDelaySeconds: 80
#failureThreshold: 3
# periodSeconds: 10
# successThreshold: 1
resources: ##容器内存限制
requests:
cpu: 0.5m
memory: 500Mi
limits:
cpu: 1m
memory: 1024Mi
volumeMounts: ##挂载
- name: prom-cm
mountPath: /etc/prometheus/
- name: prom-data
mountPath: /prometheus
---
apiVersion: v1
kind: Service
metadata:
name: prom-svc
namespace: prom
spec:
selector:
app: prometheus
type: NodePort
ports:
- name: prometheus
port: 9090
targetPort: 9090
nodePort: 30005
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: k8s-prom
namespace: prom
labels:
app: kube-state-metrics
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: k8s-prom
labels:
app: kube-state-metrics
rules:
- apiGroups:
- ""
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs:
- list
- watch
- apiGroups:
- extensions
resources:
- daemonsets
- deployments
- replicasets
- ingresses
verbs:
- list
- watch
- apiGroups:
- apps
resources:
- statefulsets
- daemonsets
- deployments
- replicasets
verbs:
- list
- watch
- apiGroups:
- batch
resources:
- cronjobs
- jobs
verbs:
- list
- watch
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs:
- list
- watch
- apiGroups:
- authentication.k8s.io
resources:
- tokenreviews
verbs:
- create
- apiGroups:
- authorization.k8s.io
resources:
- subjectaccessreviews
verbs:
- create
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- list
- watch
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests
verbs:
- list
- watch
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
- volumeattachments
verbs:
- list
- watch
- apiGroups:
- admissionregistration.k8s.io
resources:
- mutatingwebhookconfigurations
- validatingwebhookconfigurations
verbs:
- list
- watch
- apiGroups:
- networking.k8s.io
resources:
- networkpolicies
verbs:
- list
- watch
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: k8s-prom
labels:
app: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: k8s-prom
subjects:
- kind: ServiceAccount
name: k8s-prom
namespace: prom
---
- Granfan(webui界面)启动文件
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
labels:
app: grafana
namespace: prom
spec:
selector:
matchLabels:
app: grafana
template:
metadata:
name: grafana
labels:
app: grafana
namespace: prom
spec:
nodeName: node-1
volumes:
- name: grafana-data
hostPath:
path: /data/prome/grafana
containers:
- name: grafana
image: grafana/grafana:latest
imagePullPolicy: IfNotPresent
ports:
- name: grafana
containerPort: 3000
protocol: TCP
resources:
requests:
cpu: 0.05m
memory: 50Mi
limits:
cpu: 0.2m
memory: 200Mi
volumeMounts:
- name: grafana-data
mountPath: /data/grafana
---
apiVersion: v1
kind: Service
metadata:
name: grafana
labels:
app: grafana
namespace: prom
spec:
selector:
app: grafana
type: NodePort
ports:
- name: grafana
port: 3000
targetPort: 3000
nodePort: 30011
部署node节点的exporter采集器
使用deamonsetdet部署node节点的exporter,deamonset不了解的可以去官网有解释
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
labels:
k8s-app: node-exporter
spec:
selector:
matchLabels:
k8s-app: node-exporter
template:
metadata:
labels:
k8s-app: node-exporter
spec:
hostPID: true 使用宿主机pid
hostIPC: true 使用宿主机ipc
hostNetwork: true 使用宿主机网络
containers:
- image: prom/node-exporter
name: node-exporter
args:
- "--collector.systemd"
- "--collector.systemd.unit-whitelist=(docker|sshd|nginx).service"
#- "--collector.vmstat.fields=^(oom_kill|pgpg|pswp|nr|pg.*fault).*" # 需要centos8等高内核版本才支持
ports:
- containerPort: 9100
protocol: TCP
name: http
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 500m
memory: 500Mi
securityContext:
runAsUser: 0
privileged: true
volumeMounts:
- mountPath: /run/systemd/private
name: systemd-socket
readOnly: true
- name: dev
mountPath: /host/dev
- name: proc
mountPath: /host/proc
- name: sys
mountPath: /host/sys
- name: rootfs
mountPath: /host/root
tolerations: #容忍 允许身上有NoSchedule这个污点的运行
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
volumes:
- name: systemd-socket
hostPath:
path: /run/systemd/private
- name: proc
hostPath:
path: /proc
- name: dev
hostPath:
path: /dev
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
下面这挂载就是把宿主机的磁盘 目录挂到容器里面
---
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: node-exporter
name: node-exporter
spec:
ports:
- name: http
port: 9100
nodePort: 31672
protocol: TCP
type: NodePort
selector:
k8s-app: node-exporter
看到上面node的exporter全部启动成功,然后在configmap里面配置连接
- alertmanager告警组件
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
name: alertmanager-deployment
name: alertmanager
namespace: prom
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- image: prom/alertmanager:latest
name: alertmanager
ports:
- containerPort: 9093
protocol: TCP
volumeMounts:
- mountPath: "/alertmanager"
name: data
- mountPath: "/etc/alertmanager"
name: config-volume
resources:
requests:
cpu: 50m
memory: 50Mi
limits:
cpu: 200m
memory: 200Mi
volumes:
- name: data
emptyDir: {} #这里可以把数据目录挂载出来
- name: config-volume
configMap:
name: alertmanager #这下面configmap 是告警发送邮件的规则
+ 邮件规则配置
使用的是qq邮箱,其他邮箱也可以,要改配置
kind: ConfigMap
apiVersion: v1
metadata:
name: alertmanager
namespace: prom
data:
alertmanager.yml: |-
global:
smtp_smarthost: "smtp.qq.com:25" ###邮箱服务端
smtp_from: "*************@qq.com" ###发件人邮箱
smtp_auth_username: "********@qq.com" ###发件人邮箱认证
smtp_auth_password: "********" #授权码
smtp_require_tls: false
route:
group_by: [] ###自定义,报警分组名称
group_wait: 30s ## 等待多久时间发送一组报警通知
group_interval: 1m ##在新警报前的等待时间
repeat_interval: 1m #发送重复警报的时间
receiver: default-receiver ###自定义接收名称
receivers: ##定义劲爆接受者信息
- name: default-receiver ## 需与receiver的值对应
email_configs:
- to: '*******@qq.com' ###收件人
+ 告警规则编写
这里推荐一个网址里面有很多告警规则(https://awesome-prometheus-alerts.grep.to/)
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: prom
data:
# 通用角色
general.rules: |
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 1m #持续时间为1min
labels:
severity: error
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止1分钟以上."
上面这些全部启动完之后需要在Prometheus-configmap的配置文件里面加上告警组件的路径,以及告警规则文件的路径,便于Prometheus使用
以上就是k8s部署Prometheus以及告警组件的教程了,中间有不足之处希望各位指正,以便于交流学习