telegraf监控docker swarm telegraf监控进程状态

转载

mob6454cc7796a7 2024-01-14 21:13:52

这是系列文章，之前的文章如下：

Telegraf监控客户端调研笔记（1）-介绍、安装、初步测试

Telegraf大家有了基本了解了，但是能否用好，未必喽，今天我着重调研了一下Telegraf对CPU、内存、硬盘相关指标的采集，大部分指标还算容易理解，硬盘IO相关的有点麻烦，好，下面开始介绍。

CPU

CPU相关的指标比较简单，配置也比较简单，在inputs.cpu这个section，具体如下：

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states
  report_active = false

percpu默认设置为true，表示每个core都采集，建议维持默认配置，totalcpu默认为true，表示采集整体情况，平时配置告警策略就靠这个呢，所以要维持默认配置true，collect_cpu_time表示采集cpu耗费的时间，因为已经采集了各种percent指标了，time相关的感觉没有必要，可以不用打开，维持false配置，report_active相当于要采集cpu使用率，但是默认是false，建议改成true，有些人还是习惯用util的视角看待cpu使用情况。

MEM

内存相关的指标更简单，默认配置里啥都没有，如下：

[[inputs.mem]]
  # no configuration

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration
  fielddrop = ["inactive"]

fielddrop就表示要干掉某些field，与之相反的是fieldpass，表示只保留哪些，而且支持通配符，假设我只想采集percent和total相关的指标，就可以用fieldpass+通配符来解决：

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration
  fieldpass = ["*percent", "*total"]

这个例子可能不具备生产意义，主要是用于演示，大家了解即可。

DISK

这个采集插件主要是采集磁盘使用率相关的指标，配置如下：

# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]
  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

这个ignore_fs可以屏蔽掉一些不想采集的文件系统类型，很有价值，mount_points这个感觉用的会比较少，大部分情况是希望自动探测所有分区的，mount_points的作用注释说的很清楚了，不再赘述。

我们测试一下看看有哪些指标：

[root@10-255-0-34 telegraf-tmp]# ./usr/bin/telegraf --config etc/telegraf/telegraf.conf --test --input-filter disk
2021-11-06T10:29:05Z I! Starting Telegraf 1.20.2
> disk,device=vda1,fstype=xfs,host=10-255-0-34,mode=rw,path=/ free=34200436736i,inodes_free=20858317i,inodes_total=20970992i,inodes_used=112675i,total=42938118144i,used=8737681408i,used_percent=20.34947451282508 1636194546000000000

硬盘使用率重点关注的指标是used_percent和inodes_free，前者表示磁盘使用率，后者表示inode剩余量。

利用上面的例子再介绍一个技巧，上面例子中有个mode=rw的标签，如果这个标签我觉得没用想干掉，应该如何配置呢，用tagexclude，如下：

# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]
  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
  tagexclude = ["mode"]

上面的配置达到的效果，就是干掉了mode这个标签，相反的是taginclude，表示只保留某个标签，taginclude感觉场景较少，tagexclude场景应该较多。

DISKIO

截取部分配置如下：

# Read metrics about disk IO by device
[[inputs.diskio]]
  ## By default, telegraf will gather stats for all devices including
  ## disk partitions.
  ## Setting devices will restrict the stats to the specified devices.
  # devices = ["sda", "sdb", "vd*"]

devices这个字段后面可能会用到，大家最好有个印象。

硬盘IO相关的指标主要来自/proc/diskstats这个内核文件，每个字段都采集了，但是真正使用的话，其实不够用，与iostat的输出相比，少了不少内容，比如io.util、队列长度、read和write的await等，这些指标如何计算呢？

下面我们使用promql来输出这些指标，夜莺v5开始，后续版本都拥抱了Prometheus生态，所以无论是配置大盘还是告警规则，都支持promql，相关写法如下（今天刚调研这块，可能拿捏的不准，如有疏漏请读者帮忙指正）：

IO使用率（iostat %util）

rate(diskio_io_time[1m])/10

diskio_io_time表示io占用的时间，单位是毫秒，rate之后，表示每秒有多少毫秒用于io，把毫秒换算成秒，即：

rate(diskio_io_time[1m])/1000

这个query表示每秒有多少零点几秒是用于io，即io的时间占比，比率一般用%单位，所以上面的值要再乘以100，即：

rate(diskio_io_time[1m])/1000*100

所以就相当于rate之后除以10即可。

队列长度（iostat avgqu-sz）

rate(diskio_weighted_io_time[1m])/1000

diskio_weighted_io_time中包含了backlog的时间，用于表示IO队列长度。

每个写请求平均耗费时间（iostat w_await）

rate(diskio_write_time[1m])/rate(diskio_writes[1m])

单位是ms

每个读请求平均耗费时间（iostat r_await）

rate(diskio_read_time[1m])/rate(diskio_reads[1m])

单位是ms

从上面IO相关的指标可以有感受了，如果我们无法控制采集侧（很多时候就是如此），

服务端就必须要支持QL能力！

如果有使用Telegraf+Prometheus的小伙伴可以在生产环境验证一下，对比一下iostat的输出，看看是否正确（值不会和iostat完全一致，采样数据差不多就行）。如有问题欢迎留言反馈~

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：hive远程模式可以连接idea吗使用aidl完成远程service方法调用

下一篇：redis 替代 rocket mq redis rocksdb

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

telegraf监控docker swarm telegraf监控进程状态

telegraf监控docker swarm telegraf监控进程状态

51CTO博客