全局配置
参考官网:https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config
global:
# How frequently to scrape targets by default.
[ scrape_interval: <duration> | default = 1m ]
# How long until a scrape request times out.
[ scrape_timeout: <duration> | default = 10s ]
# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]
# The labels to add to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
[ <labelname>: <labelvalue> ... ]
# Rule files specifies a list of globs. Rules and alerts are read from
# all matching files.
rule_files:
[ - <filepath_glob> ... ]
# A list of scrape configurations.
scrape_configs:
[ - <scrape_config> ... ]
# Alerting specifies settings related to the Alertmanager.
alerting:
alert_relabel_configs:
[ - <relabel_config> ... ]
alertmanagers:
[ - <alertmanager_config> ... ]
# Settings related to the remote write feature.
remote_write:
[ - <remote_write> ... ]
# Settings related to the remote read feature.
remote_read:
[ - <remote_read> ... ]
更改指标标签
更改标签的时机:抓取前修改、抓取后修改、告警时修改
- 采集数据之前,通过
relabel_config
;- 采集数据之后,写入存储之前,通过
metric_relabel_configs
- 在告警前修改标签,通过alert_relabel_configs
JOB配置
- job_name: prometheus honor_labels: false kubernetes_sd_configs: - role: endpoints namespaces: names: - monitoring scrape_interval: 30s relabel_configs: - action: keep source_labels: - __meta_kubernetes_service_label_prometheus regex: k8s - source_labels: - __meta_kubernetes_endpoint_address_target_kind - __meta_kubernetes_endpoint_address_target_name separator: ; regex: Pod;(.*) replacement: ${1} target_label: pod - source_labels: - __meta_kubernetes_namespace target_label: namespace
- kubernetes_sd_configs:使用这个配置可以自动发现 k8s 中 node、service、pod、endpoint、ingress,并为其添加监控,更多的内容可以直接查看官方文档。__meta_kubernetes_xxxxx具体什么意思都可以在官网找到。
- endpoints:采用endpoints方式采集,每创建一个 service 就会创建一个对应的 endpoint,通过endpoint方式可以将service下所有的pod都采集到。
- 下面配置的意思是只有 service 的标签包含
prometheus=k8s
,k8s 才会对其对应的 endpoint 进行采集。所以我们后面要为 Prometheus 创建一个 service,并且要为这个 service 加上prometheus: k8s
这样的标签。- action: keep source_labels: - __meta_kubernetes_service_label_prometheus regex: k8s
- 下面配置意识是如果
__meta_kubernetes_endpoint_address_target_kind
的值为 Pod,__meta_kubernetes_endpoint_address_target_name
的值为 prometheus-0,在它们之间加上一个;
之后,它们合起来就是Pod;prometheus-0
。使用正则表达式Pod;(.*)
对其进行匹配,那么${1}
就是取第一个分组,它值就是 prometheus-0,最后将这个值交给 pod 这个标签。因此这一段配置就是为所有采集到的监控指标增加一个pod=prometheus-0
的标签。如果__meta_kubernetes_endpoint_address_target_kind
的值不是 Pod,那么不会添加任何标签。- source_labels: - __meta_kubernetes_endpoint_address_target_kind - __meta_kubernetes_endpoint_address_target_name separator: ; regex: Pod;(.*) replacement: ${1} target_label: pod
- 没有指定 url,Prometheus 会采集默认的 url
/metrics
。
定义告警规则
groups:
- name: example
rules:
- alert: HighRequestLatency
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High request latency
参考官网: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
for:Prometheus将在每次发出警报之前检查警报在10分钟内是否继续处于活动状态
labels
:允许指定一组附加标签来附加到警报。任何现有的冲突标签都将被覆盖。标签值可以模板化。
annotations
:指定了一组信息标签,可用于存储更长的附加信息,例如警报说明或运行手册链接。注释值可以模板化。
模板化
标签和注释值可以使用控制台模板进行模板化。该$labels
变量保存警报实例的标签键/值对。可以通过$externalLabels
变量访问已组态的外部标签。该 $value
变量保存警报实例的评估值。
# To insert a firing element's label values:
{{ $labels.<labelname> }}
# To insert the numeric expression value of the firing element:
{{ $value }}
例子:
groups:
- name: example
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"