性能监控之Telegraf+InfluxDB+Grafana+Python实现Oracle实时监控

原创

高楼（Zee） 2023-04-06 13:47:01 ©著作权

文章标签 sed oracle 数据库 文章分类 HarmonyOS 后端开发

©著作权归作者所有：来自51CTO博客作者高楼（Zee）的原创作品，请联系作者获取转载授权，否则将追究法律责任

性能监控之Telegraf+InfluxDB+Grafana+Python实现Oracle实时监控_数据库

前言

监控集成选型的 Telegraf 探针，最近需要实现对 Oracle 数据库的做实时监控，查了下 Telegraf 竟然还不支持 Oracle 监控，WTF？于是自己研究了下，通过 Python + SQL 脚本折中解决了，此文去且当作小结。

实现的效果

性能监控之Telegraf+InfluxDB+Grafana+Python实现Oracle实时监控_oracle_02

预备知识

Oracle动态性能视图

动态性能视图属于数据字典，它们的所有者为SYS，并且多数动态性能视图只能由特权用户和DBA用户查询。当数据库处于不同状态时，可以访问的动态性能视图有所不同。启动例程时，ORACLE会自动建立动态性能视图；停止例程时，ORACLE会自动删除动态性能视图。数据字典信息是从数据文件中获得，而动态性能视图信息是从SGA和控制文件取得。所以，两者所反映的信息还是有很大差异的。数据库管理员利用这些动态性能视图，可以了解数据库运行的一些基本信息，为我们进行数据库维护以及数据库性能优化提供一些数据上的支持。所有动态性能视图都是以 V_$ 开始的，Oracle 为每个动态性能视图提供了相应的同义词（V$开头）

通过查询 V$FIXED_TABLE ,可以列出所有可用的动态性能视图和动态性能表。

1. SQL> select  * from V$FIXED_TABLE  where name like 'V$%';
2. 
3. NAME                            OBJECT_ID TYPE   TABLE_NUM
4. ------------------------------ ---------- ----- ----------
5. V$WAITSTAT                     4294950915 VIEW       65537
6. V$BH                           4294951406 VIEW       65537
7. V$GC_ELEMENT                   4294951794 VIEW       65537
8. V$CR_BLOCK_SERVER              4294951796 VIEW       65537
9. V$CURRENT_BLOCK_SERVER         4294952095 VIEW       65537
10. V$POLICY_HISTORY               4294953128 VIEW       65537
11. V$ENCRYPTED_TABLESPACES        4294952996 VIEW       65537
12. V$GC_ELEMENTS_WITH_COLLISIONS  4294951798 VIEW       65537
13. V$FILE_CACHE_TRANSFER          4294951800 VIEW       65537
14. V$TEMP_CACHE_TRANSFER          4294951802 VIEW       65537
15. V$CLASS_CACHE_TRANSFER         4294951804 VIEW       65537
16. V$INSTANCE_CACHE_TRANSFER      4294952151 VIEW       65537
17. V$LOCK_ELEMENT                 4294951408 VIEW       65537
18. V$BSP                          4294951594 VIEW       65537
19. V$LOCKS_WITH_COLLISIONS        4294951410 VIEW       65537
20. V$FILE_PING                    4294951412 VIEW       65537
21. V$TEMP_PING                    4294951532 VIEW       65537
22. V$CLASS_PING                   4294951414 VIEW       65537
23. V$LOCK_ACTIVITY                4294951437 VIEW       65537
24. V$ROWCACHE                     4294950916 VIEW       65537

以下是不同类型的指标视图的快速表格比较：性能监控之Telegraf+InfluxDB+Grafana+Python实现Oracle实时监控_sed_03 该表的第一行是经典的等待事件和统计视图。以下几行是度量标准视图。度量标准视图是在 Oracle10g 中引入的。

度量视图计算增量和速率，这极大地简化了解决简单问题的能力，比如 “现在我的数据库的I/O速率是多少?” 这个问题，在10g之前，处理起来出奇的乏味。要回答这个问题，你必须查询 v$sysstat，例如:

Select value from v$sysstat where name='physical reads';

但是仅查询一次 v$sysstat 不能解决问题，而是“自数据库启动以来已完成了多少I / O？”的问题。要回答原始问题，必须两次查询 v$sysstat 并接受两个值之间的增量：

在时间A取值
在时间B取值
Delta = (B-A)
and/or get Rate = (B-A)/elapsed time

获得这些差值和速率可能是一项艰巨的工作。然后 10gOracle 引入了度量标准表，这些度量表可以在一个查询中解决问题。

等待事件视图为（系统级别）

V$SYSTEM_EVENT – 自启动以来累积的等待事件
V$EVENTMETRIC - 等待事件增量持续60秒
DBA_HIST_SYSTEM_EVENT – 自启动以来累计的上周按快照（小时）的等待事件

等待事件汇总到称为等待类的组中。对于等待类，有以下视图：

V$SYSTEM_WAIT_CLASS – 自启动以来累积
V$WAITCLASSMETRIC – 持续60秒增量
V$WAITCLASSMETRIC_HISTORY – 最后一小时的60秒增量

注意： DBA_HIST_WAITCLASSMETRIC_HISTORY 用于警报或基准，而不是日常值。

其他的就不一一展开了，具体可以参考下文：

http://datavirtualizer.com/wait-event-and-wait-class-metrics-vs-vsystem_event/

cx_Oracle

cx_Oracle 是一个 Python 扩展模块，可以访问 Oracle 数据库。它符合 Python 数据库API 2.0 规范。

基本要求

要在 Python 和 Oracle 数据库中使用 cx_Oracle7，需要满足以下条件：

Python 2.7或 3.5 及更高版本。
Oracle 客户端库。
Oracle 数据库。Oracle的标准客户端 - 服务器版本互操作性允许 cx_Oracle连接到较旧和较新的数据库。(推荐)

快速安装

在 Linux 上安装 cx_Oracle 的一般方法是使用 Python 的 Pip 包从 PyPI 安装 cx_Oracle ：

从 PyPI 安装 cx_Oracle：

1. python -m pip install cx_Oracle --upgrade

将 Oracle 客户端库添加到操作系统库搜索路径，例如 Linux 的 LDLIBRARYPATH

如果你的数据库位于远程计算机上，请下适用于你的操作系统体系结构的免费Oracle Instant Client “Basic” 或 “Basic Light” 包

至于具体的 OracleClient 安装，可以参考下文：

https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html#installing-cx-oracle-on-linux

解决方案

Python：收集 Oracle 指标数据
Telegraf：收集 Python 打印的性能指标数据
InfluxDB：存储时间序列 Oracle 性能指标数据
Grafana：可视化 Dashboard

安装

具体的安装可以参考官方文档：

Telegraf：https://docs.influxdata.com/telegraf/v1.12/introduction/installation/
InfluxDB：https://docs.influxdata.com/influxdb/v1.7/introduction/installation/
Grafana：https://grafana.com/docs/installation/rpm/

具体设置

在 InfluxDB 中创建一个 Telegraf 数据库：

1. [root@zuozewei ~]# influx
2. Connected to http://localhost:8086 version 1.6.2
3. InfluxDB shell version: 1.6.2
4. > create user "telegraf" with password 'telegraf'
5. > create database telegraf
6. > show databases
7. name: databases
8. name
9. ----
10. _internal
11. telegraf

编写 python+sql 脚本以收集 oracle 指标。脚本的输出内容很重要，必须是 InfluxDB line-protocol。该脚本查询 v$ SYSMETRIC 和 v$eventmetric ，获得最后一分钟时，等待类和等待事件指标。

python代码是：

1. import socket,argparse,subprocess,re,cx_Oracle
2. 
3. fqdn = socket.getfqdn()
4. 
5. class OraStats():
6. 
7.     def __init__(self, user, passwd, sid):
8.         self.user = user
9.         self.passwd = passwd
10.         self.sid = sid
11.         self.delengine = "none"
12.         connstr=self.user+'/'+self.passwd+'@'+self.sid
13.         self.connection = cx_Oracle.connect(connstr)
14.         cursor = self.connection.cursor()
15.         cursor.execute("select distinct(SVRNAME)  from v$dnfs_servers")
16.         rows = cursor.fetchall()
17. 
18.         for i in range(0, cursor.rowcount):
19.             self.dengine_ip = rows[i][0]
20.             proc = subprocess.Popen(["nslookup", self.dengine_ip], stdout=subprocess.PIPE)
21.             lookupresult = proc.communicate()[0].split('\n')
22. 
23.             for line in lookupresult:
24.                 if 'name=' in re.sub(r'\s', '', line):
25.                     self.delengine = re.sub('\..*$', '', re.sub(r'^.*name=', '', re.sub(r'\s', '', re.sub(r'.$', '', line))))
26. 
27.     # 等待类别
28.     def waitclassstats(self, user, passwd, sid, format):
29.         cursor = self.connection.cursor()
30.         cursor.execute("""
31.         select n.wait_class, round(m.time_waited/m.INTSIZE_CSEC,3) AAS
32.         from   v$waitclassmetric  m, v$system_wait_class n
33.         where m.wait_class_id=n.wait_class_id and n.wait_class != 'Idle'
34.         union
35.         select  'CPU', round(value/100,3) AAS
36.         from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2
37.         union select 'CPU_OS', round((prcnt.busy*parameter.cpu_count)/100,3) - aas.cpu
38.         from
39.         ( select value busy
40.         from v$sysmetric
41.         where metric_name='Host CPU Utilization (%)'
42.          and group_id=2 ) prcnt,
43.         ( select value cpu_count from v$parameter where name='cpu_count' )  parameter,
44.         ( select  'CPU', round(value/100,3) cpu from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2) aas
45.         """)
46.         for wait in cursor:
47.             wait_name = wait[0]
48.             wait_value = wait[1]
49.             print ("oracle_wait_class,fqdn={0},delphix={1},db={2},wait_class={3} wait_value={4}".format(fqdn, self.delengine, sid, re.sub(' ', '_', wait_name), wait_value))
50. 
51.     # 系统指标
52.     def sysmetrics(self, user, passwd, sid, format):
53.         cursor = self.connection.cursor()
54.         cursor.execute("""
55.         select METRIC_NAME,VALUE,METRIC_UNIT from v$sysmetric where group_id=2
56.         """)
57.         for metric in cursor:
58.             metric_name = metric[0]
59.             metric_value = metric[1]
60.             print ("oracle_sysmetric,fqdn={0},delphix={1},db={2},metric_name={3} metric_value={4}".format(fqdn,self.delengine,sid,re.sub(' ', '_', metric_name),metric_value))
61. 
62.     # 在闪回恢复区中有关磁盘配额和当前磁盘使用情况
63.     def fraused(self, user, passwd, sid, format):
64.         cursor = self.connection.cursor()
65.         cursor.execute("""
66.         select round((SPACE_USED-SPACE_RECLAIMABLE)*100/SPACE_LIMIT,1) from  V$RECOVERY_FILE_DEST
67.         """)
68.         for frau in cursor:
69.             fra_used = frau[0]
70.             print ("oracle_fra_pctused,fqdn={0},delphix={1},db={2} fra_pctused={3}".format(fqdn,self.delengine,sid,fra_used))
71. 
72.     # 磁盘使用状态
73.     def fsused(self):
74.      fss = ['/oracle', '/data']
75.      for fs in fss:
76.             df = subprocess.Popen(["df","-P",fs], stdout=subprocess.PIPE)
77.             output = df.communicate()[0]
78.             total = re.sub('%','',output.split("\n")[1].split()[1])
79.             used = re.sub('%','',output.split("\n")[1].split()[2])
80.             pctused = re.sub('%','',output.split("\n")[1].split()[4])
81.             print("oracle_fs_pctused,fqdn={0},fs_name={1} oraclefs_pctused={2},oraclefs_alloc={3},oraclefs_used={4}".format(fqdn,fs,pctused,total,used))
82. 
83.     # 等待状态
84.     def waitstats(self, user, passwd, sid, format):
85.         cursor = self.connection.cursor()
86.         cursor.execute("""
87.         select /*+ ordered use_hash(n) */
88.         n.wait_class wait_class,
89.         n.name wait_name,
90.         m.wait_count  cnt,
91.         nvl(round(10*m.time_waited/nullif(m.wait_count,0),3) ,0) avg_ms
92.         from v$eventmetric m,
93.         v$event_name n
94.         where m.event_id=n.event_id
95.         and n.wait_class <> 'Idle' and m.wait_count > 0 order by 1""")
96.         for wait in cursor:
97.             wait_class = wait[0]
98.             wait_name = wait[1]
99.             wait_cnt = wait[2]
100.             wait_avgms = wait[3]
101.             print ("oracle_wait_event,fqdn={0},delphix={1},db={2},wait_class={3},wait_event={4} count={5},latency={6}".format(fqdn, self.delengine,sid,re.sub(' ', '_', wait_class), re.sub(' ','_',wait_name),wait_cnt,wait_avgms))
102. 
103.     # 表空间使用状态
104.     def tbsstats(self, user, passwd, sid, format):
105.         cursor = self.connection.cursor()
106.         cursor.execute("""
107.         select /*+ ordered */ tablespace_name,
108.             round(used_space),
109.             round(max_size-used_space) free_space,
110.             round(max_size),
111.             round(used_space*100/max_size,2) percent_used
112.             from (
113.                 select m.tablespace_name,
114.                 m.used_space*t.block_size/1024/1024 used_space,
115.                 (case when t.bigfile='YES' then power(2,32)*t.block_size/1024/1024
116.                         else tablespace_size*t.block_size/1024/1024 end) max_size
117.             from dba_tablespace_usage_metrics m, dba_tablespaces t
118.         where m.tablespace_name=t.tablespace_name)
119.         """)
120.         for tbs in cursor:
121.             tbs_name = tbs[0]
122.             used_space_mb = tbs[1]
123.             free_space_mb = tbs[2]
124.             max_size_mb = tbs[3]
125.             percent_used = tbs[4]
126.             print ("oracle_tablespaces,fqdn={0},delphix={1},db={2},tbs_name={3} used_space_mb={4},free_space_mb={5},percent_used={6},max_size_mb={7}".format(fqdn, self.delengine, sid, re.sub(' ', '_', tbs_name), used_space_mb,free_space_mb,percent_used,max_size_mb))
127. 
128. 
129. if __name__ == "__main__":
130.     parser = argparse.ArgumentParser()
131.     parser.add_argument('-f', '--format', help="Output format, default influx", choices=['kafka', 'influx'], default='influx')
132.     subparsers = parser.add_subparsers(dest='stat')
133.     parser_all = subparsers.add_parser('ALL', help="Get all database stats")
134.     parser_all.add_argument('-u', '--user', help="Username with sys views grant", required=True)
135.     parser_all.add_argument('-p', '--passwd', required=True)
136.     parser_all.add_argument('-s', '--sid', help="tnsnames SID to connect", required=True)
137. 
138.     args = parser.parse_args()
139. 
140.     if args.stat == "ALL":
141.         stats = OraStats(args.user, args.passwd, args.sid)
142.         stats.waitclassstats(args.user, args.passwd, args.sid, args.format)
143.         stats.waitstats(args.user, args.passwd, args.sid, args.format)
144.         stats.sysmetrics(args.user, args.passwd, args.sid, args.format)
145.         stats.tbsstats(args.user, args.passwd, args.sid, args.format)
146.         stats.fraused(args.user, args.passwd, args.sid, args.format)
147.         stats.fsused()

输出格式化为 InfluxDB line-protocol

1. [root@localhost tools]# ./oracle.sh
2. oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Application wait_value=0
3. oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=CPU wait_value=0.003
4. oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=CPU_OS wait_value=0.778
5. oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Commit wait_value=0
6. oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Concurrency wait_value=0.001
7. oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Configuration wait_value=0
8. oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Network wait_value=0
9. oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Other wait_value=0
10. oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O wait_value=0.001
11. oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=User_I/O wait_value=0
12. oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Commit,wait_event=log_file_sync count=2,latency=0.122
13. oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Concurrency,wait_event=os_thread_startup count=2,latency=21.595
14. oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Network,wait_event=SQL*Net_message_to_client count=17,latency=0.001
15. oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Other,wait_event=asynch_descriptor_resize count=4,latency=0.001
16. oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O,wait_event=db_file_parallel_write count=2,latency=0.081
17. oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O,wait_event=control_file_parallel_write count=24,latency=0.268
18. oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O,wait_event=control_file_sequential_read count=71,latency=0.716
19. oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O,wait_event=log_file_parallel_write count=7,latency=0.076
20. oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=User_I/O,wait_event=Disk_file_operations_I/O count=16,laten

定义一个 shell 脚本执行 Python 脚本

1. #!/usr/bin/env bash
2. python /home/oracle/scripts/oracle_metrics.sh -f "influx" "ALL" "-u" "system" "-p" "xxxx" "-s" "172.16.106.251:1521/orcl"

在 oracle主机上，配置 telegraf 以60秒的间隔执行 python sh，然后将输出发送到 InfluxDB。编辑 /etc/telegraf/telegraf.conf 配置文件：

1. # Telegraf configuration
2. 
3. # Telegraf is entirely plugin driven. All metrics are gathered from the
4. # declared inputs, and sent to the declared outputs.
5. 
6. # Plugins must be declared in here to be active.
7. # To deactivate a plugin, comment out the name and any variables.
8. 
9. # Use 'telegraf -config telegraf.conf -test' to see what metrics a config
10. # file would generate.
11. 
12. # Global tags can be specified here in key="value" format.
13. [global_tags]
14.   # dc = "us-east-1" # will tag all metrics with dc=us-east-1
15.   # rack = "1a"
16.   host="Dprima"
17.   collector="telegraf"
18. 
19. # Configuration for telegraf agent
20. [agent]
21.   ## Default data collection interval for all inputs
22.   interval = "10s"
23.   ## Rounds collection interval to 'interval'
24.   ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
25.   round_interval = true
26. 
27.   ## Telegraf will cache metric_buffer_limit metrics for each output, and will
28.   ## flush this buffer on a successful write.
29.   metric_buffer_limit = 10000
30.   ## Flush the buffer whenever full, regardless of flush_interval.
31.   flush_buffer_when_full = true
32. 
33.   ## Collection jitter is used to jitter the collection by a random amount.
34.   ## Each plugin will sleep for a random time within jitter before collecting.
35.   ## This can be used to avoid many plugins querying things like sysfs at the
36.   ## same time, which can have a measurable effect on the system.
37.   collection_jitter = "0s"
38. 
39.   ## Default flushing interval for all outputs. You shouldn't set this below
40.   ## interval. Maximum flush_interval will be flush_interval + flush_jitter
41.   flush_interval = "60s"
42.   ## Jitter the flush interval by a random amount. This is primarily to avoid
43.   ## large write spikes for users running a large number of telegraf instances.
44.   ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
45.   flush_jitter = "0s"
46. 
47.   ## Run telegraf in debug mode
48.   debug = false
49.   ## Run telegraf in quiet mode
50.   quiet = false
51.   ## Override default hostname, if empty use os.Hostname()
52.   hostname = "Dprima"
53. 
54. 
55. ###############################################################################
56. #                                  OUTPUTS                                    #
57. ###############################################################################
58. 
59. # Configuration for influxdb server to send metrics to
60. [[outputs.influxdb]]
61.   urls = ["http://influxgraf:8086"] # required
62.   database = "telegraf" # required
63.   precision = "s"
64.   timeout = "5s"
65. 
66. [[outputs.influxdb]]
67.   urls = ["http://localhost:9092"] # required
68.   database = "kapacitor" # required
69.   precision = "s"
70.   retention_policy = "default"
71.   timeout = "5s"
72. 
73. #[[outputs.file]]
74. #  files=["/home/oracle/scripts/telegraf_debug.txt"]
75. ###############################################################################
76. #                                  INPUTS                                     #
77. ###############################################################################
78. 
79. # Oracle metrics
80. [[inputs.exec]]
81.   # Shell/commands array
82.   commands = ["/home/oracle/scripts/oracle_metrics.sh"]
83.   # Data format to consume. This can be "json", "influx" or "graphite" (line-protocol)
84.   # NOTE json only reads numerical measurements, strings and booleans are ignored.
85.   data_format = "influx"
86.   interval = "60s"
87. 
88. ###############################################################################
89. #                              SERVICE INPUTS                                 #
90. ###############################################################################

启动 telegraf：

1. telegraf -config /etc/telegraf/telegraf.conf

数据可视化

查询 InfluxDB 数据库

1. [root@localhost log]# influx
2. Connected to http://localhost:8086 version 1.7.4
3. InfluxDB shell version: 1.7.4
4. Enter an InfluxQL query
5. > show databases
6. name: databases
7. name
8. ----
9. _internal
10. telegraf
11. > use telegraf
12. Using database telegraf
13. > show measurements
14. name: measurements
15. name
16. ----
17. oracle_fra_pctused
18. oracle_sysmetric
19. oracle_tablespaces
20. oracle_wait_class
21. oracle_wait_event
22. > select * from oracle_sysmetric limit 5
23. name: oracle_sysmetric
24. time                db                      delphix fqdn                  host                  metric_name                                   metric_value
25. ----                --                      ------- ----                  ----                  -----------                                   ------------
26. 1554277680000000000 172.16.14.251:1521/orcl none    localhost.localdomain localhost.localdomain Active_Parallel_Sessions                      0
27. 1554277680000000000 172.16.14.251:1521/orcl none    localhost.localdomain localhost.localdomain Active_Serial_Sessions                        1
28. 1554277680000000000 172.16.14.251:1521/orcl none    localhost.localdomain localhost.localdomain Average_Active_Sessions                       0.0138029495084
29. 1554277680000000000 172.16.14.251:1521/orcl none    localhost.localdomain localhost.localdomain Average_Synchronous_Single-Block_Read_Latency 0.5875
30. 1554277680000000000 172.16.14.251:1521/orcl none    localhost.localdomain localhost.localdomain Background_CPU_Usage_Per_Sec                  0.104149308449
31. >

Grafana 效果图如下：性能监控之Telegraf+InfluxDB+Grafana+Python实现Oracle实时监控_数据库_04