初学grafana性能监控,小小记录。。
Windows下的监控解决方案所有工具栈(Grafana,InfluxDB和Telegraf)全部支持在Windows实例上运行
监控架构
一套监控系统,一般情况下都由三部分组成,指标收集器、数据存储和可视化工具(UI界面):
指标收集器,用来获取系统或者Agent的监控项目数据,一般有监控Agent和一些数据搜集脚本构成。常见的数据指标收集器有Zabbix Agent,Telegraf,CollectD,StatsD,Datadog,Pushgetway以及可能其他收集指标的工具。
数据存储,用来存储监控数据的数据库、时间序列数据库等,比如Mysql、RRDtool、ElasticSearch以及本文中用到的InfluxDB等。
可视化工具:Zabbix的php前端,Nagios、Grafana,Chronograf等。
本文中监控架构如下图所示:
Telegraf将定期查询Windows性能计数器API获取Windows监控数据并将结果发送到InfluxDB数据库。Grafana通过InfluxDB数据接口查询数据并通过Dashboard展现出来,根据告警阈值发出告警。
系统部署
根据监控架构我们知道,主要要部署三个组件Telegraf、InfluxDB和Grafana。下面我们就分步骤说明他们部署和配置的过程。三种软件都可以通过其官网免费下载到,下载Windows64位版本。
InfluxDB安装
https://dl.influxdata.com/influxdb/releases/influxdb-1.7.3_windows_amd64.zip
)我选择的是1.7.3版本的
zip包保存到目录(本文中我们我们使用D:\monitor)。解压压缩包,该程序主要以下文件构成:
influx.exe:一个InfluxDB客户端CLI的执行文件,用于访问数据库和测试。
influxd.exe: InfluxDB服务器端程序,这是最主要的程序,通过它我们启动一个InfluxDB实例。
influx_stress.exe:InfluxDB压力测试工具。
Influx_inspect: InfluxDB磁盘和分片检查工具。
解压程序包后我们启动一个Window管理员Powshell终端,并定位到该目录D:\grafana\monitor\influxdb-1.7.3-1
然后直接运行influxd.exe,就能启动InfluxDB实例,如下图:
如上图所示,我们必须再WDF上开启对该程序的访问。为了方便此处我们直接启动InfluxDB实例。如果是正式使用,应该将加入Windows服务,以服务方式启动和管理。
Telegraf安装
Telegraf是InfluxDB的辅助数据采集程序,其安装包也在InfluxDB下载页页面,选择合适Windows版本下载。下载后我们需要在当前系统的Program Files文件夹创建Telegraf目录,并将Telegraf安装包中的telegraf.conf文件复制到C:\Program Files\Telegraf目录。
(
https://dl.influxdata.com/telegraf/releases/telegraf-1.16.2_windows_amd64.zip
)
以管理员身份启动Powershell。cd到D:\grafana\monitor\telegrafmkdir Telegraf运行:
telegraf.exe --service install
编辑 C:\Program Files\Telegraf目录下的telegraf.conf
###############################################################################
# INPUTS #
###############################################################################
# Windows Performance Counters plugin.
# These are the recommended method of monitoring system metrics on windows,
# as the regular system plugins (inputs.cpu, inputs.mem, etc.) rely on WMI,
# which utilize more system resources.
#
# See more configuration examples at:
# https://github.com/influxdata/telegraf/tree/master/plugins/inputs/win_perf_counters
[[inputs.win_perf_counters]]
[[inputs.win_perf_counters.object]]
# Processor usage, alternative to native, reports on a per core.
ObjectName = "Processor"
Instances = ["*"]
Counters = [
"% Idle Time",
"% Interrupt Time",
"% Privileged Time",
"% User Time",
"% Processor Time",
"% DPC Time",
]
Measurement = "win_cpu"
# Set to true to include _Total instance when querying for all (*).
IncludeTotal=true
[[inputs.win_perf_counters.object]]
# Disk times and queues
ObjectName = "LogicalDisk"
Instances = ["*"]
Counters = [
"% Idle Time",
"% Disk Time",
"% Disk Read Time",
"% Disk Write Time",
"Current Disk Queue Length",
"% Free Space",
"Free Megabytes",
]
Measurement = "win_disk"
# Set to true to include _Total instance when querying for all (*).
#IncludeTotal=false
[[inputs.win_perf_counters.object]]
ObjectName = "PhysicalDisk"
Instances = ["*"]
Counters = [
"Disk Read Bytes/sec",
"Disk Write Bytes/sec",
"Current Disk Queue Length",
"Disk Reads/sec",
"Disk Writes/sec",
"% Disk Time",
"% Disk Read Time",
"% Disk Write Time",
]
Measurement = "win_diskio"
[[inputs.win_perf_counters.object]]
ObjectName = "Network Interface"
Instances = ["*"]
Counters = [
"Bytes Received/sec",
"Bytes Sent/sec",
"Packets Received/sec",
"Packets Sent/sec",
"Packets Received Discarded",
"Packets Outbound Discarded",
"Packets Received Errors",
"Packets Outbound Errors",
]
Measurement = "win_net"
[[inputs.win_perf_counters.object]]
ObjectName = "System"
Counters = [
"Context Switches/sec",
"System Calls/sec",
"Processor Queue Length",
"System Up Time",
]
Instances = ["------"]
Measurement = "win_system"
# Set to true to include _Total instance when querying for all (*).
#IncludeTotal=false
[[inputs.win_perf_counters.object]]
# Example query where the Instance portion must be removed to get data back,
# such as from the Memory object.
ObjectName = "Memory"
Counters = [
"Available Bytes",
"Cache Faults/sec",
"Demand Zero Faults/sec",
"Page Faults/sec",
"Pages/sec",
"Transition Faults/sec",
"Pool Nonpaged Bytes",
"Pool Paged Bytes",
"Standby Cache Reserve Bytes",
"Standby Cache Normal Priority Bytes",
"Standby Cache Core Bytes",
]
# Use 6 x - to remove the Instance bit from the query.
Instances = ["------"]
Measurement = "win_mem"
# Set to true to include _Total instance when querying for all (*).
#IncludeTotal=false
[[inputs.win_perf_counters.object]]
# Example query where the Instance portion must be removed to get data back,
# such as from the Paging File object.
ObjectName = "Paging File"
Counters = [
"% Usage",
]
Instances = ["_Total"]
Measurement = "win_swap"
要验证它是否有效,请运行:
C:\Program Files\Telegraf>telegraf.exe --config telegraf.conf --test
然后通过命令行
net start telegraf
其他操作telegraf
可以通过 --service
管理自己的服务:
telegraf.exe --service install #安装服务
telegraf.exe --service uninstall #删除服务
telegraf.exe --service start #启动服务
telegraf.exe --service stop #停止服务
Ok,整个采集和入库就完成了,这是跳转到InfluxDB服务实例启动窗口就能看到,采集数据入库的操作。
集成Influxdb
编辑 C:\Program Files\Telegraf目录下的telegraf.conf
找到 OUTPUTS
配置项
###############################################################################
# OUTPUTS #
###############################################################################
# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
## The full HTTP or UDP URL for your InfluxDB instance.
##
## Multiple URLs can be specified for a single cluster, only ONE of the
## urls will be written to each interval.
# urls = ["unix:///var/run/influxdb.sock"]
# urls = ["udp://127.0.0.1:8089"]
urls = ["http://127.0.0.0:8086"]
## The target database for metrics; will be created as needed.
database = "telegraf"
## If true, no CREATE DATABASE queries will be sent. Set to true when using
## Telegraf with a user without permissions to create databases or when the
## database already exists.
# skip_database_creation = false
## Name of existing retention policy to write to. Empty string writes to
## the default retention policy. Only takes effect when using HTTP.
# retention_policy = ""
## Write consistency (clusters only), can be: "any", "one", "quorum", "all".
## Only takes effect when using HTTP.
# write_consistency = "any"
## Timeout for HTTP messages.
timeout = "5s"
## HTTP Basic Auth
username = "telegraf"
password = "telegraf"
## HTTP User-Agent
# user_agent = "telegraf"
## UDP payload size is the maximum packet size to send.
# udp_payload = "512B"
## Optional TLS Config for use on HTTP connections.
# tls_ca = "/etc/telegraf/ca.pem"
# tls_cert = "/etc/telegraf/cert.pem"
# tls_key = "/etc/telegraf/key.pem"
## Use TLS but skip chain & host verification
# insecure_skip_verify = false
## HTTP Proxy override, if unset values the standard proxy environment
## variables are consulted to determine which proxy, if any, should be used.
# http_proxy = "http://corporate.proxy:3128"
## Additional HTTP headers
# http_headers = {"X-Special-Header" = "Special-Value"}
## HTTP Content-Encoding for write request body, can be set to "gzip" to
## compress body or "identity" to apply no encoding.
# content_encoding = "identity"
## When true, Telegraf will output unsigned integers as unsigned values,
## i.e.: "42u". You will need a version of InfluxDB supporting unsigned
## integer values. Enabling this option will result in field type errors if
## existing data has been written.
# influx_uint_support = false
Influx访问InfluxDB
我们也可以通过InfluxDB数据库查看是否数据已经入库,同时我们也通过该操作简单演示Influx.exe的使用方法。InfluxDB的访问和操作类比Mysql其实Influx的操作和Mysql操作很相似。
在终端界面运行Influx.exe。应该看到一个CLI,然后就可以操作查询了。
比如要显示实例总的数据库,命令为
show databases;
要跳转到数据库使用use,比如:
use telegraf
显示数据测试表中的数据用SQL语句,比如查询win_cpu表的数据:
SELECT * FROM win_cpu
后面加了limit 1 只显示一条
验证刚才的配置:
Grafana安装
Grafana安装和上面其他两个组件的安装也类似。在Grafana官网下载页面下载zip安装包并将其解压到安装目录。通过管理员终端转到Grafana文件的bin目录,运行grafana-server.exe。
默认情况下,Grafana会监听3000端口,默认登陆用户和密码为admin/admin(首次登陆后,为提示修改密码)。登陆系统后,会提示设置数据源。默认情况下,InfluxDB实例在端口8086上运行。我们按照以下配置:
集成Grafana Dashboard
访问 https://grafana.com/grafana/dashboards/1555下载一个合适的 Dashboard 模版
可以copy ID 也可以下载json
监控效果
Grafana Dashboard 最终效果如下: