数据库的巡检是DBA工作中的一部分,有时候我们还是希望能够在巡检的基础上发现一些潜在的问题,把尽可能多的问题解决在初始阶段。
今天来给大家举一个数据库巡检和性能分析的例子。
首先拿到一个数据库服务器,了解系统信息是必要的,同时还要分析数据库的信息,然后尽可能发现是否存在性能瓶颈,然后需要做一个对比的分析。
系统信息
$ cat /etc/issue
Red Hat Enterprise Linux Server release 5.3(Tikanga)
Kernel \r on an \m
$ ksh cpuinfo.sh
**************************************
CPU Physical NO: 2
CPU Processor NO: 16
CPU Core NO: cpu cores : 4
CPU model name : Intel(R) Xeon(R) CPU E5620@ 2.40GHz
**************************************
top - 10:39:48 up 389 days, 2:28, 1 user, load average: 0.91, 0.91,0.80
Tasks: 1370 total, 1 running, 1363 sleeping, 0 stopped, 6 zombie
Cpu(s): 1.2%us, 0.2%sy, 0.0%ni, 96.8%id, 1.6%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 65996212k total, 65820480k used, 175732k free, 530412k buffers
Swap: 16779884k total, 236k used, 16779648k free, 17410172kcached
Hugepage已经启用了。
[oracle@acc136 bdump]$ cat /proc/meminfo | grep -i page
AnonPages: 4783576 kB
PageTables: 359020 kB
HugePages_Total: 20525
HugePages_Free: 60
HugePages_Rsvd: 16
Hugepagesize: 2048 kB
数据库级信息
数据库是10gR2,2014年启动至今
内存组件的使用情况
Cache Sizes
~~~~~~~~~~~ Begin End
--------------------
BufferCache: 39,472M 39,472M Std Block Size: 8K
SharedPool Size: 1,440M 1,440M Log Buffer: 14,256K
其它内存组件的大小
Session信息的统计
锁和事务情况
$ ksh showlock.sh
Current Locks
-------------
There are also 0 transaction locks
Blocking Session Details
Redo日志切换频率
表空间使用情况
常规检查,就不贴图了。
用户资源使用情况
查看数据库中用户资源的使用情况。常规检查就不贴图了。
近一周的数据库负载图表
针对两个不同时段的性能抖动进行分析。
第一个性能抖动最剧烈的时间段,是在8月8日凌晨
等待事件如下,可以看到主要的性能瓶颈在于IO
CPU资源都消耗在sql部分。
Top sql如下:
Elapsed CPU Elap per % Total
Time (s) Time (s) Executions Exec (s) DB Time SQL Id
---------- ---------- ---------------------- ------- -------------
1,856 31 288,077 0.0 18.9 57j9uu7c9681a
Module: JDBC Thin Client
SELECT * FROM TEST_CN_BIND WHERE CN=:1 AND CN_TYPE IN(1,2,3) AND ENABLED='Y'ORDER BY
CN_TYPE
1,659 75 1,352 1.2 16.9 acbdxf552ud62
update TEST_USER_BILLING set LOGIN_STATUS = 1 where UIN = :1
1,162 328 1 1162.1 11.8 b6usrg82hwsa3
Module: DBMS_SCHEDULER
call dbms_stats.gather_database_stats_job_proc ( )
172,774 1,352 127.8 1.4 75.33 1659.42 acbdxf552ud62
update USER_BILLING set LOGIN_STATUS = 1 where UIN = :1
性能问题分析:
IO问题
从Oracle的角度来看,IO瓶颈较高,针对目前的情况,没有更好的系统级改进建议
The throughput of the I/O subsystem wassignificantly lower than expected.
RECOMMENDATION 1: Host Configuration, 13% benefit (1258 seconds)
ACTION: Consider increasing the throughput of the I/O subsystem.
Oracle's recommended solution is to stripe all data file using the
SAME methodology. You might also need to increase the number of disks
for better performance. Alternatively, consider using Oracle's
Automatic Storage Management solution.
RATIONALE: During the analysis period, the average data files' I/O
throughput was 52 M persecond for reads and 2.1 M per second for
writes. The average response time for single block reads was 5.9
milliseconds.
后台自动job运行
call dbms_stats.gather_database_stats_job_proc ( )
后台job运行时,会根据条件进行统计信息的收集。
从Top sql来看,大表test_user_billing的查询acbdxf552ud62基于unique index scan,但是执行时间在1.4秒,主要的原因就是因为在执行期间同时在后台进行统计信息的收集。
从Oracle的建议可以看到其实做了一个全对象扫描,产生了大量的物理读。
ACTION: Run "Segment Advisor" onTABLE "ACC.USER_BILLING" with object id
51864.
RELEVANT OBJECT: database object with id 51864
ACTION: Investigate application logic involving I/O on TABLE
"xxxx.TEST_USER_BILLING" with object id 51864.
RELEVANT OBJECT: database object with id 51864
RATIONALE: The I/O usage statistics for the object are: 1 full object
scans, 11827830 physicalreads, 459490 physical writes and 0 direct
reads.
RATIONALE: The SQL statement with SQL_ID "acbdxf552ud62" spent
significant time waiting for User I/O on the hot object.
RELEVANT OBJECT: SQL statement with SQL_ID acbdxf552ud62
update TEST_USER_BILLING set LOGIN_STATUS = 1 where UIN = :1
RATIONALE: The SQL statement with SQL_ID "92a49umxy7q8m" spent
significant time waiting for UserI/O on the hot object.
RELEVANT OBJECT: SQL statement with SQL_ID 92a49umxy7q8m
select /*+ no_parallel(t) no_parallel_index(t) dbms_stats
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0)
no_monitoring */ count(*),count("CARD_NO"),count(distinct
"CARD_NO"),count("MAC_VAL"),count(distinct"MAC_VAL") from
"ACC"."USER_BILLING" sample ( 9.1540402221) t
第二个性能抖动时间点的分析
第二个时间点的分析可以排除后台job的运行影响,主要的瓶颈还是在于IO
性能问题分析:
The throughput of the I/Osubsystem was significantly lower than expected.
RECOMMENDATION 1: Host Configuration, 30% benefit (2038 seconds)
ACTION: Consider increasing the throughput of the I/O subsystem.
Oracle's recommended solution is to stripe all data file using the
SAME methodology. You might also need to increase the number of disks
for better performance. Alternatively, consider using Oracle's
Automatic Storage Management solution.
RATIONALE: During the analysis period, the average data files' I/O
throughput was 1.8 M persecond for reads and 3.3 M per second for
writes. The average response time for single block reads was 14
milliseconds.
SYMPTOMS THAT LED TO THE FINDING:
SYMPTOM: Wait class "User I/O" was consuming significantdatabase time.
(93% impact [6405 seconds])
改进建议:
开启异步IO
目前系统中aio配置存在,但是没有启用
$ cat /proc/sys/fs/aio-nr
65536
$ cat/proc/sys/fs/aio-max-nr
65536
$ /usr/bin/ldd $ORACLE_HOME/bin/oracle | greplibaio
libaio.so.1 => /usr/lib64/libaio.so.1 (0x00002af9f4ad8000)
SQL> alter system setfilesystemio_options=setall scope=spfile;
后台Job的调度
需要进行确认是否可以重新选择一个低峰时间段来运行Job或者从后台禁用。按照时间频率进行统计信息的收集
SGA组件的调整
从内存组件的使用情况来看,shared pool的资源已经被buffer cache进行了压榨,可以适当调整一下shared pool的大小,比如设置为4G左右,目前仅为1G
内容根据情况看适度做了删减,可以看出来做一个数据库巡检的过程中其实还是需要花费不少的精力来分析问题,找到性能的瓶颈,这也是我们能够持续改进质量的基线。