Prometheus监控docker容器

cAdvisor 将容器统计信息公开为 Prometheus 指标。默认情况下, 这些指标在/metrics HTTP 端点下提供。可以通过设置-prometheus_endpoint 命令行标志来自定义此端点。 要使用 Prometheus 监控 cAdvisor, 只需在 Prometheus 中配置一个或多个作业, 这些作业会 在该指标端点处刮取相关的 cAdvisor 流程.

1 启动cAdvisor容器

docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
google/cadvisor:latest

image.png 验证采集的指标

curl http://172.16.0.8:8080/metrics

常用的一些容器监控项(根据cadvisor获取的指标promQL计算): 容器 CPU 使用率: sum(irate(container_cpu_usage_seconds_total{image!=""}[1m])) without (cpu) 查询容器内存使用量(单位: 字节) : container_memory_usage_bytes{image!=""} 查询容器网络接收量速率(单位: 字节/秒) :sum(rate(container_network_receive_bytes_total{image!=""}[1m])) without (interface) 查询容器网络传输量速率(单位: 字节/秒) : sum(rate(container_network_transmit_bytes_total{image!=""}[1m])) without (interface) 查询容器文件系统读取速率(单位: 字节/秒) : sum(rate(container_fs_reads_bytes_total{image!=""}[1m])) without (device) 查询容器文件系统写入速率(单位: 字节/秒) : sum(rate(container_fs_writes_bytes_total{image!=""}[1m])) without (device)

2 prometheus增加docker监控

- job_name: 'docker'
static_configs:
- targets: ['172.16.0.8:8080']

重启prometheus服务

3 Grafana dashboard

导入docker监控模板,dashboard Id:193

4 添加告警规则

  - alert: ContainerKilled
    expr: time() - container_last_seen > 60
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Container killed (instance {{ $labels.instance }})
      description: "A container has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # cAdvisor can sometimes consume a lot of CPU, so this alert will fire constantly.
  # If you want to exclude it from this alert, exclude the serie having an empty name: container_cpu_usage_seconds_total{name!=""}
  - alert: ContainerCpuUsage
    expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container CPU usage (instance {{ $labels.instance }})
      description: "Container CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # See https:///faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
  - alert: ContainerMemoryUsage
    expr: (sum(container_memory_working_set_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container Memory usage (instance {{ $labels.instance }})
      description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ContainerVolumeUsage
    expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container Volume usage (instance {{ $labels.instance }})
      description: "Container Volume usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ContainerVolumeIoUsage
    expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container Volume IO usage (instance {{ $labels.instance }})
      description: "Container Volume IO usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ContainerHighThrottleRate
    expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container high throttle rate (instance {{ $labels.instance }})
      description: "Container is being throttled\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"