promethues查询resourcequotas gpu prometheus源码分析

转载

落花有意飞花 2024-05-29 19:53:25

文章标签 源码 prometheus 检查点 json python 文章分类 架构后端开发

在进行源码讲解关于prometheus还有一些配置和使用，需要解释一下。首先是API的使用，prometheus提供了一套HTTP的接口

curl http://localhost:9090/api/v1/query?query=go_goroutines|python -m json.tool

{
    "data": {
        "result": [
            {
                "metric": {
                    "__name__": "go_goroutines",
                    "instance": "localhost:9090",
                    "job": "prometheus"
                },
                "value": [
                    1493347106.901,
                    "119"
                ]
            },
            {
                "metric": {
                    "__name__": "go_goroutines",
                    "instance": "10.39.0.45:9100",
                    "job": "node"
                },
                "value": [
                    1493347106.901,
                    "13"
                ]
            },
            {
                "metric": {
                    "__name__": "go_goroutines",
                    "instance": "10.39.0.53:9100",
                    "job": "node"
                },
                "value": [
                    1493347106.901,
                    "11"
                ]
            }
        ],
        "resultType": "vector"
    },
    "status": "success"
}

上面演示一个查询go_goroutines这一个监控指标的数据。让然也可以基于开始时间和截止时间查询，但更强大的功能应该是支持OR查询

[root@slave3 ~]# curl -g 'http://localhost:9090/api/v1/series?match[]=up&match[]=process_start_time_seconds{job="prometheus"}'|python -m json.tool
{
    "data": [
        {
            "__name__": "up",
            "instance": "10.39.0.53:9100",
            "job": "node"
        },
        {
            "__name__": "up",
            "instance": "localhost:9090",
            "job": "prometheus"
        },
        {
            "__name__": "up",
            "instance": "10.39.0.45:9100",
            "job": "node"
        },
        {
            "__name__": "process_start_time_seconds",
            "instance": "localhost:9090",
            "job": "prometheus"
        }
    ],
    "status": "success"
}

查询一个系列的数据，当然还可以通过DELETE去删除系列。还记得上一篇说的设置job和targets了吗？也可以通过API查询

curl http://localhost:9090/api/v1/label/job/values
{"status":"success","data":["node","prometheus"]}

当然有哪些监控对象也可以查询

curl http://localhost:9090/api/v1/targets|python -m json.tool
{
    "data": {
        "activeTargets": [
            {
                "discoveredLabels": {
                    "__address__": "10.39.0.53:9100",
                    "__metrics_path__": "/metrics",
                    "__scheme__": "http",
                    "job": "node"
                },
                "health": "up",
                "labels": {
                    "instance": "10.39.0.53:9100",
                    "job": "node"
                },
                "lastError": "",
                "lastScrape": "2017-04-28T02:47:40.871586825Z",
                "scrapeUrl": "http://10.39.0.53:9100/metrics"
            },
            {
                "discoveredLabels": {
                    "__address__": "10.39.0.45:9100",
                    "__metrics_path__": "/metrics",
                    "__scheme__": "http",
                    "job": "node"
                },
                "health": "up",
                "labels": {
                    "instance": "10.39.0.45:9100",
                    "job": "node"
                },
                "lastError": "",
                "lastScrape": "2017-04-28T02:47:45.144032466Z",
                "scrapeUrl": "http://10.39.0.45:9100/metrics"
            },
            {
                "discoveredLabels": {
                    "__address__": "localhost:9090",
                    "__metrics_path__": "/metrics",
                    "__scheme__": "http",
                    "job": "prometheus"
                },
                "health": "up",
                "labels": {
                    "instance": "localhost:9090",
                    "job": "prometheus"
                },
                "lastError": "",
                "lastScrape": "2017-04-28T02:47:44.079111193Z",
                "scrapeUrl": "http://localhost:9090/metrics"
            }
        ]
    },
    "status": "success"
}

查询这些target。alertmanagers也是通过/api/v1/alertmanagers可以查询的。对应prometheus的本地存储还有一些关键的配置需要注意：
prometheus_local_storage_memory_series：当前的系列数量在内存中保存。
prometheus_local_storage_open_head_chunks：打开头块的数量。
prometheus_local_storage_chunks_to_persist：仍然需要将其持续到磁盘的内存块数。
prometheus_local_storage_memory_chunks：目前在记忆中的块数。如果减去前两个，则可以得到持久化块的数量（如果查询当前没有使用，则它们是可驱动的）。
prometheus_local_storage_series_chunks_persisted：每个批次持续存在块数的直方图。
prometheus_local_storage_rushed_mode如果prometheus斯处于“冲动模式”，则为1，否则为0。可用于计算prometheus处于冲动模式的时间百分比。
prometheus_local_storage_checkpoint_last_duration_seconds：最后一个检查点需要多长时间
prometheus_local_storage_checkpoint_last_size_bytes：最后一个检查点的大小（以字节为单位）。
prometheus_local_storage_checkpointing是1，而prometheus是检查点，否则为0。可以用来计算普罗米修斯检查点的时间百分比。
prometheus_local_storage_inconsistencies_total：找到存储不一致的计数器。如果大于0，请重新启动服务器进行恢复。
prometheus_local_storage_persist_errors_total：反对持续错误。
prometheus_local_storage_memory_dirty_series：当前脏系列数量。
process_resident_memory_bytes广义地说，prometheus进程所占据的物理内存。
go_memstats_alloc_bytes：去堆大小（分配的对象在使用中加分配对象不再使用，但尚未被垃圾回收）。

prometheus还另一个高级应用就是集群联邦，通过定义slave，这样就可以在每个数据中心部署一个，然后通过联邦汇聚。

- scrape_config:
  - job_name: dc_prometheus
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{__name__=~"^job:.*"}'   # Request all job-level time series
    static_configs:
      - targets:
        - dc1-prometheus:9090
        - dc2-prometheus:9090

当然如果存储量不够还可以通过分片去采集，

global:
  external_labels:
    slave: 1  # This is the 2nd slave. This prevents clashes between slaves.
scrape_configs:
  - job_name: some_job
    # Add usual service discovery here, such as static_configs
    relabel_configs:
    - source_labels: [__address__]
      modulus:       4    # 4 slaves
      target_label:  __tmp_hash
      action:        hashmod
    - source_labels: [__tmp_hash]
      regex:         ^1$  # This is the 2nd slave
      action:        keep

上面定义hash的方式去决定每个prometheus负责的targe他。

- scrape_config:
  - job_name: slaves
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{__name__=~"^slave:.*"}'   # Request all slave-level time series
    static_configs:
      - targets:
        - slave0:9090
        - slave1:9090
        - slave3:9090
        - slave4:9090

下面定义了多个slave。这样数据就可以分片存储了。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。