初学grafana性能监控,小小记录。。

Windows下的监控解决方案所有工具栈(Grafana,InfluxDB和Telegraf)全部支持在Windows实例上运行

监控架构

一套监控系统,一般情况下都由三部分组成,指标收集器、数据存储和可视化工具(UI界面):

指标收集器,用来获取系统或者Agent的监控项目数据,一般有监控Agent和一些数据搜集脚本构成。常见的数据指标收集器有Zabbix Agent,Telegraf,CollectD,StatsD,Datadog,Pushgetway以及可能其他收集指标的工具。

数据存储,用来存储监控数据的数据库、时间序列数据库等,比如Mysql、RRDtool、ElasticSearch以及本文中用到的InfluxDB等。

可视化工具:Zabbix的php前端,Nagios、Grafana,Chronograf等。

本文中监控架构如下图所示:

grafana promethues监控平台封装 grafana监控windows模板_grafana

 

Telegraf将定期查询Windows性能计数器API获取Windows监控数据并将结果发送到InfluxDB数据库。Grafana通过InfluxDB数据接口查询数据并通过Dashboard展现出来,根据告警阈值发出告警。

系统部署

根据监控架构我们知道,主要要部署三个组件Telegraf、InfluxDB和Grafana。下面我们就分步骤说明他们部署和配置的过程。三种软件都可以通过其官网免费下载到,下载Windows64位版本。

InfluxDB安装


https://dl.influxdata.com/influxdb/releases/influxdb-1.7.3_windows_amd64.zip

)我选择的是1.7.3版本的

zip包保存到目录(本文中我们我们使用D:\monitor)。解压压缩包,该程序主要以下文件构成:

influx.exe:一个InfluxDB客户端CLI的执行文件,用于访问数据库和测试。

influxd.exe: InfluxDB服务器端程序,这是最主要的程序,通过它我们启动一个InfluxDB实例。

influx_stress.exe:InfluxDB压力测试工具。

Influx_inspect: InfluxDB磁盘和分片检查工具。

解压程序包后我们启动一个Window管理员Powshell终端,并定位到该目录D:\grafana\monitor\influxdb-1.7.3-1

然后直接运行influxd.exe,就能启动InfluxDB实例,如下图:

grafana promethues监控平台封装 grafana监控windows模板_grafana_02

如上图所示,我们必须再WDF上开启对该程序的访问。为了方便此处我们直接启动InfluxDB实例。如果是正式使用,应该将加入Windows服务,以服务方式启动和管理。

Telegraf安装

Telegraf是InfluxDB的辅助数据采集程序,其安装包也在InfluxDB下载页页面,选择合适Windows版本下载。下载后我们需要在当前系统的Program Files文件夹创建Telegraf目录,并将Telegraf安装包中的telegraf.conf文件复制到C:\Program Files\Telegraf目录。


(

https://dl.influxdata.com/telegraf/releases/telegraf-1.16.2_windows_amd64.zip

)

以管理员身份启动Powershell。cd到D:\grafana\monitor\telegrafmkdir Telegraf运行:

telegraf.exe --service install

编辑 C:\Program Files\Telegraf目录下的telegraf.conf

###############################################################################
#                                  INPUTS                                     #
###############################################################################

# Windows Performance Counters plugin.
# These are the recommended method of monitoring system metrics on windows,
# as the regular system plugins (inputs.cpu, inputs.mem, etc.) rely on WMI,
# which utilize more system resources.
#
# See more configuration examples at:
#   https://github.com/influxdata/telegraf/tree/master/plugins/inputs/win_perf_counters

[[inputs.win_perf_counters]]
  [[inputs.win_perf_counters.object]]
    # Processor usage, alternative to native, reports on a per core.
    ObjectName = "Processor"
    Instances = ["*"]
    Counters = [
      "% Idle Time",
      "% Interrupt Time",
      "% Privileged Time",
      "% User Time",
      "% Processor Time",
      "% DPC Time",
    ]
    Measurement = "win_cpu"
    # Set to true to include _Total instance when querying for all (*).
    IncludeTotal=true

  [[inputs.win_perf_counters.object]]
    # Disk times and queues
    ObjectName = "LogicalDisk"
    Instances = ["*"]
    Counters = [
      "% Idle Time",
      "% Disk Time",
      "% Disk Read Time",
      "% Disk Write Time",
      "Current Disk Queue Length",
      "% Free Space",
      "Free Megabytes",
    ]
    Measurement = "win_disk"
    # Set to true to include _Total instance when querying for all (*).
    #IncludeTotal=false

  [[inputs.win_perf_counters.object]]
    ObjectName = "PhysicalDisk"
    Instances = ["*"]
    Counters = [
      "Disk Read Bytes/sec",
      "Disk Write Bytes/sec",
      "Current Disk Queue Length",
      "Disk Reads/sec",
      "Disk Writes/sec",
      "% Disk Time",
      "% Disk Read Time",
      "% Disk Write Time",
    ]
    Measurement = "win_diskio"

  [[inputs.win_perf_counters.object]]
    ObjectName = "Network Interface"
    Instances = ["*"]
    Counters = [
      "Bytes Received/sec",
      "Bytes Sent/sec",
      "Packets Received/sec",
      "Packets Sent/sec",
      "Packets Received Discarded",
      "Packets Outbound Discarded",
      "Packets Received Errors",
      "Packets Outbound Errors",
    ]
    Measurement = "win_net"

  [[inputs.win_perf_counters.object]]
    ObjectName = "System"
    Counters = [
      "Context Switches/sec",
      "System Calls/sec",
      "Processor Queue Length",
      "System Up Time",
    ]
    Instances = ["------"]
    Measurement = "win_system"
    # Set to true to include _Total instance when querying for all (*).
    #IncludeTotal=false

  [[inputs.win_perf_counters.object]]
    # Example query where the Instance portion must be removed to get data back,
    # such as from the Memory object.
    ObjectName = "Memory"
    Counters = [
      "Available Bytes",
      "Cache Faults/sec",
      "Demand Zero Faults/sec",
      "Page Faults/sec",
      "Pages/sec",
      "Transition Faults/sec",
      "Pool Nonpaged Bytes",
      "Pool Paged Bytes",
      "Standby Cache Reserve Bytes",
      "Standby Cache Normal Priority Bytes",
      "Standby Cache Core Bytes",

    ]
    # Use 6 x - to remove the Instance bit from the query.
    Instances = ["------"]
    Measurement = "win_mem"
    # Set to true to include _Total instance when querying for all (*).
    #IncludeTotal=false

  [[inputs.win_perf_counters.object]]
    # Example query where the Instance portion must be removed to get data back,
    # such as from the Paging File object.
    ObjectName = "Paging File"
    Counters = [
      "% Usage",
    ]
    Instances = ["_Total"]
    Measurement = "win_swap"

要验证它是否有效,请运行:

C:\Program Files\Telegraf>telegraf.exe --config telegraf.conf --test

然后通过命令行

net start telegraf

其他操作
telegraf 可以通过 --service 管理自己的服务:

telegraf.exe --service install		#安装服务
telegraf.exe --service uninstall	#删除服务
telegraf.exe --service start		#启动服务
telegraf.exe --service stop			#停止服务

 

Ok,整个采集和入库就完成了,这是跳转到InfluxDB服务实例启动窗口就能看到,采集数据入库的操作。

grafana promethues监控平台封装 grafana监控windows模板_Time_03

集成Influxdb

编辑 C:\Program Files\Telegraf目录下的telegraf.conf

找到 OUTPUTS 配置项

###############################################################################
#                                  OUTPUTS                                    #
###############################################################################

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
  ## The full HTTP or UDP URL for your InfluxDB instance.
  ##
  ## Multiple URLs can be specified for a single cluster, only ONE of the
  ## urls will be written to each interval.
  # urls = ["unix:///var/run/influxdb.sock"]
  # urls = ["udp://127.0.0.1:8089"]
  urls = ["http://127.0.0.0:8086"]

  ## The target database for metrics; will be created as needed.
  database = "telegraf"

  ## If true, no CREATE DATABASE queries will be sent.  Set to true when using
  ## Telegraf with a user without permissions to create databases or when the
  ## database already exists.
  # skip_database_creation = false

  ## Name of existing retention policy to write to.  Empty string writes to
  ## the default retention policy.  Only takes effect when using HTTP.
  # retention_policy = ""

  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all".
  ## Only takes effect when using HTTP.
  # write_consistency = "any"

  ## Timeout for HTTP messages.
  timeout = "5s"

  ## HTTP Basic Auth
   username = "telegraf"
   password = "telegraf"

  ## HTTP User-Agent
  # user_agent = "telegraf"

  ## UDP payload size is the maximum packet size to send.
  # udp_payload = "512B"

  ## Optional TLS Config for use on HTTP connections.
  # tls_ca = "/etc/telegraf/ca.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false

  ## HTTP Proxy override, if unset values the standard proxy environment
  ## variables are consulted to determine which proxy, if any, should be used.
  # http_proxy = "http://corporate.proxy:3128"

  ## Additional HTTP headers
  # http_headers = {"X-Special-Header" = "Special-Value"}

  ## HTTP Content-Encoding for write request body, can be set to "gzip" to
  ## compress body or "identity" to apply no encoding.
  # content_encoding = "identity"

  ## When true, Telegraf will output unsigned integers as unsigned values,
  ## i.e.: "42u".  You will need a version of InfluxDB supporting unsigned
  ## integer values.  Enabling this option will result in field type errors if
  ## existing data has been written.
  # influx_uint_support = false

 

Influx访问InfluxDB

我们也可以通过InfluxDB数据库查看是否数据已经入库,同时我们也通过该操作简单演示Influx.exe的使用方法。InfluxDB的访问和操作类比Mysql其实Influx的操作和Mysql操作很相似。

在终端界面运行Influx.exe。应该看到一个CLI,然后就可以操作查询了。

比如要显示实例总的数据库,命令为

show databases;

要跳转到数据库使用use,比如:

use telegraf

显示数据测试表中的数据用SQL语句,比如查询win_cpu表的数据:

SELECT * FROM win_cpu

grafana promethues监控平台封装 grafana监控windows模板_grafana_04

后面加了limit 1 只显示一条

验证刚才的配置:

grafana promethues监控平台封装 grafana监控windows模板_数据_05

Grafana安装

Grafana安装和上面其他两个组件的安装也类似。在Grafana官网下载页面下载zip安装包并将其解压到安装目录。通过管理员终端转到Grafana文件的bin目录,运行grafana-server.exe。

默认情况下,Grafana会监听3000端口,默认登陆用户和密码为admin/admin(首次登陆后,为提示修改密码)。登陆系统后,会提示设置数据源。默认情况下,InfluxDB实例在端口8086上运行。我们按照以下配置:

grafana promethues监控平台封装 grafana监控windows模板_grafana_06

grafana promethues监控平台封装 grafana监控windows模板_数据_07

grafana promethues监控平台封装 grafana监控windows模板_Time_08

grafana promethues监控平台封装 grafana监控windows模板_数据_09

集成Grafana Dashboard

访问 https://grafana.com/grafana/dashboards/1555下载一个合适的 Dashboard 模版

grafana promethues监控平台封装 grafana监控windows模板_数据_10

可以copy ID 也可以下载json

grafana promethues监控平台封装 grafana监控windows模板_数据_11

grafana promethues监控平台封装 grafana监控windows模板_grafana_12

 

监控效果

Grafana Dashboard 最终效果如下:

grafana promethues监控平台封装 grafana监控windows模板_数据_13