prometheus-配置文件

global、rule_files、remote_read、remote_write

一、global(全局配置)

global:
  # 抓取指标的间隔,默认1m
  scrape_interval: 15s

  # 抓取指标的超时时间,默认10s
  scrape_timeout: 10s

  # 指定Prometheus评估规则的频率[记录规则(record)和告警规则(alert)],默认1m.
  # 可以理解为执行规则的时间间隔
  evaluation_interval: 30s

  # 用于区分不同的prometheus
  external_labels:
    prometheus: test

  # PromQL查询记录日志文件。重新加载配置会重新打开文件。
  query_log_file: /tmp/query.log

二、rule_files(规则配置)

这里介绍一下prometheus支持的两种规则:

  • 记录规则(recording rules):允许预先计算使用频繁且开销大的表达式,并将结果保存为一个新的时间序列数据,然后查询的时候就不会耗费太多的系统资源和加快查询速度。
  • 警报规则(alerting rules): 这个就是自定义告警规则的
# 加载指定的规则文件
rule_files:
  - "first.rules"
  - "my/*.rules"

Prometheus支持两种类型的规则:记录规则和警报规则。 要在Prometheus中包含规则,请创建一个包含必要规则语句的文件,并让Prometheus通过Prometheus配置中的rule_files字段加载规则文件。
通过将SIGHUP发送到Prometheus进程,可以在运行时重新加载规则文件。 这些更改仅适用于所有规则文件格式良好的情况。

语法检查规则
要在不启动Prometheus进程的情况下快速检查规则文件是否在语法上正确,可以通过安装并运行Prometheus的promtool命令行工具来校验:

go get github.com/prometheus/prometheus/cmd/promtool

使用例子

[root@fabric-cli prometheus-2.2.1.linux-amd64]# ls -l
总用量 108104
drwxrwxr-x 2 1000 1000       38 3月  14 22:14 console_libraries
drwxrwxr-x 2 1000 1000      173 3月  14 22:14 consoles
drwxr-xr-x 5 root root       85 5月  12 00:05 data
-rw-rw-r-- 1 1000 1000    11357 3月  14 22:14 LICENSE
-rw-rw-r-- 1 1000 1000     2769 3月  14 22:14 NOTICE
-rwxr-xr-x 1 1000 1000 66176282 3月  14 22:17 prometheus
-rw-r--r-- 1 root root      167 5月   4 10:47 prometheus.rules.yml
-rw-rw-r-- 1 1000 1000      879 5月   4 10:49 prometheus.yml
-rwxr-xr-x 1 1000 1000 44492910 3月  14 22:18 promtool

[root@fabric-cli prometheus-2.2.1.linux-amd64]# ./promtool check rules prometheus.rules.yml 
Checking prometheus.rules.yml
  SUCCESS: 1 rules found

规则语法:

groups:
  [ - <rule_group> ]

<rule_group>的语法
# 规则组名 必须是唯一的
name: <string>

# 规则评估间隔时间
[ interval: <duration> | default = global.evaluation_interval ]

rules:
  [ - <rule> ... ]

<rule>的语法
# 收集的指标名称
record: <string>

# 评估时间
# evaluated at the current time, and the result recorded as a new set of
# time series with the metric name as given by 'record'.
expr: <string>

# Labels to add or overwrite before storing the result.
labels:
  [ <labelname>: <labelvalue> ]

例子

groups:
  - name: example
    rules:
    - record: job:http_inprogress_requests:sum
      expr: sum(http_inprogress_requests) by (job)
groups:
  - name: instance
    rules:
    - record: test:go_memstats_alloc_bytes:sum
      expr: sum(go_memstats_alloc_bytes) by (instance)

    - record: instance:node_cpu_load:1m
      expr: node_load1

    - record: instance:node_cpu_load:5m
      expr: node_load5

    - record: instance:node_cpu_load:15m
      expr: node_load15

    - record: instance:node_cpu_util:1m
      expr: (1- (sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by(instance) /sum(increase(node_cpu_seconds_total[1m])) by(instance)))*100

另告警规则语法如下

# The name of the alert. Must be a valid metric name.
alert: <string>

# The PromQL expression to evaluate. Every evaluation cycle this is
# evaluated at the current time, and all resultant time series become
# pending/firing alerts.
expr: <string>

# Alerts are considered firing once they have been returned for this long.
# Alerts which have not yet fired for long enough are considered pending.
[ for: <duration> | default = 0s ]

# Labels to add or overwrite for each alert.
labels:
  [ <labelname>: <tmpl_string> ]

# Annotations to add to each alert.
annotations:
  [ <labelname>: <tmpl_string> ]

告警规则例子

groups:
- name: example
  rules:
  - alert: HighErrorRate
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: High request latency

三、remote_read、remote_write(远程读写配置)

remote_write:
  # 指定写入数据的url
  - url: http://remote1/push
    # 远程写配置的名称,如果指定,则在远程写配置中必须是唯一的。该名称将用于度量标准和日志记录中,代替生成的值,以帮助用户区分远程写入配置。
    name: drop_expensive
    # 远程写重新打标签配置
    write_relabel_configs:
    - source_labels: [__name__]
      regex:         expensive.*
      action:        drop
  # 指定写入数据的第二个url
  - url: http://remote2/push
    name: rw_tls
    # tls连接配置
    tls_config:
      cert_file: valid_cert_file
      key_file: valid_key_file

remote_read:
  # 指定读取数据的url
  - url: http://remote1/read
    # 表示近期数据也要从远程存储读取,因为Prometheus近期数据无论如何都是要读本地存储的。设置为true时,Prometheus会把本地和远程的数据进行Merge。默认是false,即从本地缓存查询近期数据.
    read_recent: true
    name: default
  # 指定读取数据的第二个url
  - url: http://remote3/read
    # 从本地缓存查询近期数据
    read_recent: false
    name: read_special
    # 可选的匹配器列表,必须存在于选择器中以查询远程读取端点。
    required_matchers:
      job: special
    tls_config:
      cert_file: valid_cert_file
      key_file: valid_key_file