在进行源码讲解关于prometheus还有一些配置和使用,需要解释一下。首先是API的使用,prometheus提供了一套HTTP的接口
curl http://localhost:9090/api/v1/query?query=go_goroutines|python -m json.tool
{
"data": {
"result": [
{
"metric": {
"__name__": "go_goroutines",
"instance": "localhost:9090",
"job": "prometheus"
},
"value": [
1493347106.901,
"119"
]
},
{
"metric": {
"__name__": "go_goroutines",
"instance": "10.39.0.45:9100",
"job": "node"
},
"value": [
1493347106.901,
"13"
]
},
{
"metric": {
"__name__": "go_goroutines",
"instance": "10.39.0.53:9100",
"job": "node"
},
"value": [
1493347106.901,
"11"
]
}
],
"resultType": "vector"
},
"status": "success"
}
上面演示一个查询go_goroutines这一个监控指标的数据。让然也可以基于开始时间和截止时间查询,但更强大的功能应该是支持OR查询
[root@slave3 ~]# curl -g 'http://localhost:9090/api/v1/series?match[]=up&match[]=process_start_time_seconds{job="prometheus"}'|python -m json.tool
{
"data": [
{
"__name__": "up",
"instance": "10.39.0.53:9100",
"job": "node"
},
{
"__name__": "up",
"instance": "localhost:9090",
"job": "prometheus"
},
{
"__name__": "up",
"instance": "10.39.0.45:9100",
"job": "node"
},
{
"__name__": "process_start_time_seconds",
"instance": "localhost:9090",
"job": "prometheus"
}
],
"status": "success"
}
查询一个系列的数据,当然还可以通过DELETE去删除系列。还记得上一篇说的设置job和targets了吗?也可以通过API查询
curl http://localhost:9090/api/v1/label/job/values
{"status":"success","data":["node","prometheus"]}
当然有哪些监控对象也可以查询
curl http://localhost:9090/api/v1/targets|python -m json.tool
{
"data": {
"activeTargets": [
{
"discoveredLabels": {
"__address__": "10.39.0.53:9100",
"__metrics_path__": "/metrics",
"__scheme__": "http",
"job": "node"
},
"health": "up",
"labels": {
"instance": "10.39.0.53:9100",
"job": "node"
},
"lastError": "",
"lastScrape": "2017-04-28T02:47:40.871586825Z",
"scrapeUrl": "http://10.39.0.53:9100/metrics"
},
{
"discoveredLabels": {
"__address__": "10.39.0.45:9100",
"__metrics_path__": "/metrics",
"__scheme__": "http",
"job": "node"
},
"health": "up",
"labels": {
"instance": "10.39.0.45:9100",
"job": "node"
},
"lastError": "",
"lastScrape": "2017-04-28T02:47:45.144032466Z",
"scrapeUrl": "http://10.39.0.45:9100/metrics"
},
{
"discoveredLabels": {
"__address__": "localhost:9090",
"__metrics_path__": "/metrics",
"__scheme__": "http",
"job": "prometheus"
},
"health": "up",
"labels": {
"instance": "localhost:9090",
"job": "prometheus"
},
"lastError": "",
"lastScrape": "2017-04-28T02:47:44.079111193Z",
"scrapeUrl": "http://localhost:9090/metrics"
}
]
},
"status": "success"
}
查询这些target。alertmanagers也是通过/api/v1/alertmanagers可以查询的。对应prometheus的本地存储还有一些关键的配置需要注意:
prometheus_local_storage_memory_series:当前的系列数量在内存中保存。
prometheus_local_storage_open_head_chunks:打开头块的数量。
prometheus_local_storage_chunks_to_persist:仍然需要将其持续到磁盘的内存块数。
prometheus_local_storage_memory_chunks:目前在记忆中的块数。如果减去前两个,则可以得到持久化块的数量(如果查询当前没有使用,则它们是可驱动的)。
prometheus_local_storage_series_chunks_persisted:每个批次持续存在块数的直方图。
prometheus_local_storage_rushed_mode如果prometheus斯处于“冲动模式”,则为1,否则为0。可用于计算prometheus处于冲动模式的时间百分比。
prometheus_local_storage_checkpoint_last_duration_seconds:最后一个检查点需要多长时间
prometheus_local_storage_checkpoint_last_size_bytes:最后一个检查点的大小(以字节为单位)。
prometheus_local_storage_checkpointing是1,而prometheus是检查点,否则为0。可以用来计算普罗米修斯检查点的时间百分比。
prometheus_local_storage_inconsistencies_total:找到存储不一致的计数器。如果大于0,请重新启动服务器进行恢复。
prometheus_local_storage_persist_errors_total:反对持续错误。
prometheus_local_storage_memory_dirty_series:当前脏系列数量。
process_resident_memory_bytes广义地说,prometheus进程所占据的物理内存。
go_memstats_alloc_bytes:去堆大小(分配的对象在使用中加分配对象不再使用,但尚未被垃圾回收)。
prometheus还另一个高级应用就是集群联邦,通过定义slave,这样就可以在每个数据中心部署一个,然后通过联邦汇聚。
- scrape_config:
- job_name: dc_prometheus
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{__name__=~"^job:.*"}' # Request all job-level time series
static_configs:
- targets:
- dc1-prometheus:9090
- dc2-prometheus:9090
当然如果存储量不够还可以通过分片去采集,
global:
external_labels:
slave: 1 # This is the 2nd slave. This prevents clashes between slaves.
scrape_configs:
- job_name: some_job
# Add usual service discovery here, such as static_configs
relabel_configs:
- source_labels: [__address__]
modulus: 4 # 4 slaves
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: ^1$ # This is the 2nd slave
action: keep
上面定义hash的方式去决定每个prometheus负责的targe他。
- scrape_config:
- job_name: slaves
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{__name__=~"^slave:.*"}' # Request all slave-level time series
static_configs:
- targets:
- slave0:9090
- slave1:9090
- slave3:9090
- slave4:9090
下面定义了多个slave。这样数据就可以分片存储了。