prometheus基本概念用法记录

原创

Felix2531 2019-01-04 17:44:45 博主文章分类：k8s ©著作权

文章标签 prometheus 用法记录 文章分类 Docker 云计算

©著作权归作者所有：来自51CTO博客作者Felix2531的原创作品，请联系作者获取转载授权，否则将追究法律责任

Prometheus

基本概念

promethues是一套开源的系统监控报警框架。

Prometheus 所有采集的监控数据均以指标（metric）的形式保存在内置的时间序列数据库当中（TSDB）：属于同一指标名称，同一标签集合的、有时间戳标记的数据流。除了存储的时间序列，Prometheus 还可以根据查询请求产生临时的、衍生的时间序列作为返回结果。

特点：
- 强大的多为数据模型
- 灵活的查询语言
- 易于管理
- 高效
- 使用pull模式采集时间序列数据
- 多种可视化图形界面
- 易于伸缩
prometheus组成和架构：
- prometheus server: 主要负责数据采集和存储，提供promQL查询语言支持。prometheus是一个时序数据库，将采集到的监控数据按照时间序列的方式存储到本地磁盘。
- Push Gateway: 支持临时性job主动推送指标的中间网关。
- PromDash：使用rails开发的dashboard，用于可视化指标数据。
- Exporters: 负责监控机器运行状态，提供被监控组件信息的 HTTP 接口被叫做 exporter。
  - 直接采集： exporter内置了prometheus支持，直接向prometheus暴露数据端点。
  - 间接采集：原不支持prometheus。通过prometheus提供的clien library编写的目标监控采集程序。
- Altermanager: 从 Prometheus server 端接收到 alerts 后，会进行去除重复数据，分组，并路由到对收的接受方式，发出报警。常见的接收方式有：电子邮件，pagerduty，OpsGenie, webhook 等。
- WebUI:9090提供图形化界面功能。
基本工作原理
- Prometheus server 定期从配置好的 jobs 或者 exporters 中拉 metrics，或者接收来自 Pushgateway 发过来的 metrics，或者从其他的 Prometheus server 中拉 metrics。
- Prometheus server 在本地存储收集到的 metrics，并运行已定义好的 alert.rules，记录新的时间序列或者向 Alertmanager 推送警报。
- Alertmanager 根据配置文件，对接收到的警报进行处理，发出告警。
- 在图形界面中，可视化采集数据。
基本概念：

数据模型：prometheus中存储的数据为时间序列，是由Metric的名字和一系列的标签（键值对）唯一标识的，不同的标签代表不同的时间序列。

样本：实际时间序列，每个序列包括一个float64的值和一个毫秒级的时间戳。（指标+时间戳+样本值）

metric名字：具有语义，表示功能：例如：http_requests_total, 表示 http 请求的总数。其中，metric 名字由 ASCII 字符，数字，下划线，以及冒号组成，且必须满足正则表达式 [a-zA-Z_:][a-zA-Z0-9_:]*。

标签：使一个时间序列有不同未读的识别。例如 http_requests_total{method="Get"} 表示所有 http 请求中的 Get 请求。当 method="post" 时，则为新的一个 metric。标签中的键由 ASCII 字符，数字，以及下划线组成，且必须满足正则表达式 [a-zA-Z_:][a-zA-Z0-9_:]*。

格式：<metric name>{<label name>=<label value>, …}，例如：http_requests_total{method="POST",endpoint="/api/tracks"}。

Metric类型

counter: 累加性metirc。

Gauge：可增减性metric

Histogram：树状图

summary：汇总

PromQL查询

数据类型

瞬时向量(instant vector)：一组时间序列，每个时间序列包含单个样本。
区间向量(range vector)：一组时间序列，每个时间序列包含一段时间范围内的样本数据。
标量(scalar): 一个浮点型数据值。
字符串(string): 一个简单的字符串值。

时间序列过滤器

瞬时向量过滤器：
	eg: http_requests_total ，通过{}里附件一组标签过滤时间序列。
	标签匹配云算符：
    = : 选择与提供的字符串完全相同的标签。
    != : 选择与提供的字符串不相同的标签。
    =~ : 选择正则表达式与提供的字符串（或子字符串）相匹配的标签。
    !~ : 选择正则表达式与提供的字符串（或子字符串）不匹配的标签。
区间向量过滤器：
	eg：http_requests_total{job="prometheus"}[5m]，通过[]指定区间提取数值。
	时间单位：
	s - 秒
    m - 分钟
    h - 小时
    d - 天
    w - 周
    y - 年
时间位移操作：
	在瞬时向量表达式或者区间向量表达式中，都是以当前时间为基准.
	eg:http_requests_total offset 5m "offset 关键字需要紧跟在选择器（{}）后面"

操作符

算数二次元运算符
	eg:加减乘除
布尔运算符：
	eg：= ，！= ，< , > ,<= ,>= 
集合运算符：
	and，or，unless
匹配模式

聚合操作

语法：<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)]  只有count_values, quantile, topk, bottomk支持参数(parameter)
sum (求和)；min (最小值)；max (最大值)；avg (平均值)；stddev (标准差)；stdvar (标准差异)；count (计数)；count_values (对 value 进行计数)；bottomk (样本值最小的 k 个元素)；topk (样本值最大的k个元素)；quantile (分布统计)
eg:<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)]
without 用于从计算结果中移除列举的标签，而保留其它标签。by 则正好相反，结果向量中只保留列出的标签，其余标签则移除。通过 without 和 by 可以按照样本的问题对数据进行聚合。

任务和实例

采集不同的监控指标，我们需要运行相应的监控采集程序，并且让prometheus server知道这些export实例的访问地址。每一个监控样本的http服务称之为一个实例。node exporter可以称之为一个实例。

一组用于相同采集目的的实例，或者一个采集进程的多个副本则通过一个一个任务管理。
```
* job: node
    * instance 2: 1.2.3.4:9100
    * instance 4: 5.6.7.8:9100
```

HTTP API中响应格式

瞬时数据查询：
	url请求参数：
	eg:'http://localhost:9090/api/v1/query?query=up&time=2015-07-01T20:10:51.781Z'
        query=：PromQL表达式。
        time=<rfc3339 | unix_timestamp>：用于指定用于计算PromQL的时间戳。可选参数，默认情况下使用当前系统时间。
        timeout=：超时设置。可选参数，默认情况下使用-query,timeout的全局设置
区间数据查询：
	url请求参数：
	eg:'http://localhost:9090/api/v1/query_range?query=up&start=2015-07-01T20:10:30.781Z&end=2015-07-01T20:11:00.781Z&step=15s'
	query=: PromQL表达式。
    start=<rfc3339 | unix_timestamp>: 起始时间。
    end=<rfc3339 | unix_timestamp>: 结束时间。
    step=: 查询步长。
    timeout=: 超时设置。可选参数，默认情况下使用-query,timeout的全局设置。

Prometheus告警

告警规则定义（Alertrule difinition）

告警名称：自定义名称.

告警规则：基于PromQL表达式定义告警触发条件.定义在配置文件中

   groups:
   - name: example
     rules:
     - alert: HighErrorRate
       expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
       for: 10m
       labels:
         severity: page
       annotations:
         summary: High request latency
         description: description info
    #group:定义一组相关规则
    #alert：告警规则名称
    #expr：基于PromQL的触发条件
    #for 等待评估时间
    #label 自定义标签
    #annotation： 指定一组附加信息Alertmanger特性

Altermanager特性

 分组：可以将详细的告警机制合并成一个通知
   抑制：当发出一个警告时，可以停止重复发送此告警的引发的其他告警机制
   静默：将告警进行静默处理

安装启动Altermanger

wget https://github.com/prometheus/alertmanager/releases/download/v0.15.3/alertmanager-0.15.3.linux-amd64.tar.gz
   cd alertmanager-0.15.3.linux-amd64/
   ./alertmanager

altermanager.yml配置文件介绍

   global:
     resolve_timeout: 5m
   
   route:
     group_by: ['alertname']
     group_wait: 10s
     group_interval: 10s
     repeat_interval: 1h
     receiver: 'web.hook'
   receivers:
   - name: 'web.hook'
     webhook_configs:
     - url: 'http://127.0.0.1:5001/'
   inhibit_rules:
     - source_match:
         severity: 'critical'
       target_match:
         severity: 'warning'
       equal: ['alertname', 'dev', 'instance']
   路由(route)以及接收器(receivers)。所有的告警信息都会从配置中的顶级路由(route)进入路由树，根据路由规则将告警信息发送给相应的接收器。
   全局配置（global）：用于定义一些全局的公共参数，如全局的SMTP配置，Slack配置等内容；
   模板（templates）：用于定义告警通知时的模板，如HTML模板，邮件模板等；
   告警路由（route）：根据标签匹配，确定当前告警应该如何处理；
   接收人（receivers）：接收人是一个抽象的概念，它可以是一个邮箱也可以是微信，Slack或者Webhook等，接收人一般配合告警路由使用；
   抑制规则（inhibit_rules）：合理设置抑制规则可以减少垃圾告警的产生

重启prometheus

killall -9 prometheus
nohup prometheus &

prometheus 安装

安装prometheus server

wget https://github.com/prometheus/prometheus/releases/download/v2.6.0/prometheus-2.6.0.linux-amd64.tar.gz
tar -zxvf prometheus-2.6.0.linux-amd64.tar.gz
cd prometheus-2.6.0.linux-amd64
./prometheus &
ln -s /root/prometheus/prometheus-2.6.0.linux-amd64/prometheus /usr/local/bin/prometheus
设置开机启动
  	cat >> /usr/lib/systemd/system/multi-user.target.wants/prometheus.service <<EOF
  	[Unit]
  	Description=prometheus
  	After=network.target

  	[Service]
  	Type=forking
  	ExecStart=/usr/local/bin/prometheus --config.file="/root/prometheus-2.6.0.linux-amd64/prometheus.yml" --storage.tsdb.path=/root/prometheus-2.6.0.linux-amd64/data
  	PrivateTmp=true

  	[Install]
  	WantedBy=multi-user.target
  	EOF
  	systemctl enable prometheus
  	systemctl start  prometheus

安装Node Exporter 采集主机运行数据(采集主机运行指标比如cpu，内存和磁盘等信息)

wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz
tar -zxvf node_exporter-0.17.0.linux-amd64.tar.gz 
cd node_exporter-0.17.0.linux-amd64/
mv node_exporter  /usr/local/bin/
nohup  node_exporter &

curl -l -H "Content-type: application/json" -X POST -d '{"msgtype": "markdown","markdown": {"title":"Prometheus告警信息","text": "#### 监控指标\n> 监控描述信息\n\n> ###### 告警时间 \n"},"at": {"isAtAll": false}}' https://oapi.dingtalk.com/robot/send?access_token=51345145d106753486bd71614bf881283f91e2124535276b257f99327e41dc87
{"errcode":0,"errmsg":"ok"}

Prometheus中添加收集的监控数据，修改prometheus.yml文件，并在scrape_configs添加一下内容。

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  # 采集node exporter监控数据
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

使用Grafana创建可视化Dashboard

docker run -d -p 3000:3000 grafana/grafana
#访问http://localhost:3000 默认用户名admin 密码admin

先记录，后补充。

参考文章如下：

prometheus非官方手册

prometheus-book

上一篇：Rook 笔记

下一篇：kubernetes命令总结集

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯