数据库的巡检是DBA工作中的一部分,有时候我们还是希望能够在巡检的基础上发现一些潜在的问题,把尽可能多的问题解决在初始阶段。

今天来给大家举一个数据库巡检和性能分析的例子。

首先拿到一个数据库服务器,了解系统信息是必要的,同时还要分析数据库的信息,然后尽可能发现是否存在性能瓶颈,然后需要做一个对比的分析。

 

系统信息

$ cat /etc/issue

Red Hat Enterprise Linux Server release 5.3(Tikanga)

Kernel \r on an \m

$ ksh cpuinfo.sh

**************************************

CPU Physical NO: 2

CPU Processor NO: 16

CPU Core NO: cpu cores : 4

CPU model name : Intel(R) Xeon(R) CPU E5620@ 2.40GHz

**************************************

top - 10:39:48 up 389 days, 2:28, 1 user, load average: 0.91, 0.91,0.80

Tasks: 1370 total, 1 running, 1363 sleeping, 0 stopped, 6 zombie

Cpu(s): 1.2%us, 0.2%sy, 0.0%ni, 96.8%id, 1.6%wa, 0.0%hi, 0.2%si, 0.0%st

Mem: 65996212k total, 65820480k used, 175732k free, 530412k buffers

Swap: 16779884k total, 236k used, 16779648k free, 17410172kcached

Hugepage已经启用了。

[oracle@acc136 bdump]$ cat /proc/meminfo | grep -i page

AnonPages: 4783576 kB

PageTables: 359020 kB

HugePages_Total: 20525

HugePages_Free: 60

HugePages_Rsvd: 16

Hugepagesize: 2048 kB

数据库级信息

数据库是10gR2,2014年启动至今

 

内存组件的使用情况

Cache Sizes

~~~~~~~~~~~ Begin End

--------------------

BufferCache: 39,472M 39,472M Std Block Size: 8K

SharedPool Size: 1,440M 1,440M Log Buffer: 14,256K

其它内存组件的大小

记一次数据库的分析和优化建议(r6笔记第24天)_其他
Session信息的统计记一次数据库的分析和优化建议(r6笔记第24天)_其他_02

记一次数据库的分析和优化建议(r6笔记第24天)_其他_03

锁和事务情况

$ ksh showlock.sh

Current Locks

-------------

There are also 0 transaction locks

 

Blocking Session Details

Redo日志切换频率

记一次数据库的分析和优化建议(r6笔记第24天)_其他_04

表空间使用情况

常规检查,就不贴图了。

用户资源使用情况

查看数据库中用户资源的使用情况。常规检查就不贴图了。

近一周的数据库负载图表

记一次数据库的分析和优化建议(r6笔记第24天)_其他_05

针对两个不同时段的性能抖动进行分析。

第一个性能抖动最剧烈的时间段,是在88日凌晨

等待事件如下,可以看到主要的性能瓶颈在于IO

记一次数据库的分析和优化建议(r6笔记第24天)_其他_06

CPU资源都消耗在sql部分。

记一次数据库的分析和优化建议(r6笔记第24天)_其他_07

记一次数据库的分析和优化建议(r6笔记第24天)_其他_08

Top sql如下:

Elapsed CPU Elap per % Total

Time (s) Time (s) Executions Exec (s) DB Time SQL Id

---------- ---------- ---------------------- ------- -------------

1,856 31 288,077 0.0 18.9 57j9uu7c9681a

Module: JDBC Thin Client

SELECT * FROM TEST_CN_BIND WHERE CN=:1 AND CN_TYPE IN(1,2,3) AND ENABLED='Y'ORDER BY

CN_TYPE

1,659 75 1,352 1.2 16.9 acbdxf552ud62

update TEST_USER_BILLING set LOGIN_STATUS = 1 where UIN = :1

1,162 328 1 1162.1 11.8 b6usrg82hwsa3

Module: DBMS_SCHEDULER

call dbms_stats.gather_database_stats_job_proc ( )

172,774 1,352 127.8 1.4 75.33 1659.42 acbdxf552ud62

update USER_BILLING set LOGIN_STATUS = 1 where UIN = :1

性能问题分析:

IO问题

Oracle的角度来看,IO瓶颈较高,针对目前的情况,没有更好的系统级改进建议

The throughput of the I/O subsystem wassignificantly lower than expected.

RECOMMENDATION 1: Host Configuration, 13% benefit (1258 seconds)

ACTION: Consider increasing the throughput of the I/O subsystem.

Oracle's recommended solution is to stripe all data file using the

SAME methodology. You might also need to increase the number of disks

for better performance. Alternatively, consider using Oracle's

Automatic Storage Management solution.

RATIONALE: During the analysis period, the average data files' I/O

throughput was 52 M persecond for reads and 2.1 M per second for

writes. The average response time for single block reads was 5.9

milliseconds.

后台自动job运行

call dbms_stats.gather_database_stats_job_proc ( )

后台job运行时,会根据条件进行统计信息的收集。

Top sql来看,大表test_user_billing的查询acbdxf552ud62基于unique index scan,但是执行时间在1.4秒,主要的原因就是因为在执行期间同时在后台进行统计信息的收集。

记一次数据库的分析和优化建议(r6笔记第24天)_其他_09

Oracle的建议可以看到其实做了一个全对象扫描,产生了大量的物理读。

ACTION: Run "Segment Advisor" onTABLE "ACC.USER_BILLING" with object id

51864.

RELEVANT OBJECT: database object with id 51864

ACTION: Investigate application logic involving I/O on TABLE

"xxxx.TEST_USER_BILLING" with object id 51864.

RELEVANT OBJECT: database object with id 51864

RATIONALE: The I/O usage statistics for the object are: 1 full object

scans, 11827830 physicalreads, 459490 physical writes and 0 direct

reads.

RATIONALE: The SQL statement with SQL_ID "acbdxf552ud62" spent

significant time waiting for User I/O on the hot object.

RELEVANT OBJECT: SQL statement with SQL_ID acbdxf552ud62

update TEST_USER_BILLING set LOGIN_STATUS = 1 where UIN = :1

RATIONALE: The SQL statement with SQL_ID "92a49umxy7q8m" spent

significant time waiting for UserI/O on the hot object.

RELEVANT OBJECT: SQL statement with SQL_ID 92a49umxy7q8m

select /*+ no_parallel(t) no_parallel_index(t) dbms_stats

cursor_sharing_exact use_weak_name_resl dynamic_sampling(0)

no_monitoring */ count(*),count("CARD_NO"),count(distinct

"CARD_NO"),count("MAC_VAL"),count(distinct"MAC_VAL") from

"ACC"."USER_BILLING" sample ( 9.1540402221) t

第二个性能抖动时间点的分析

第二个时间点的分析可以排除后台job的运行影响,主要的瓶颈还是在于IO

记一次数据库的分析和优化建议(r6笔记第24天)_其他_10

记一次数据库的分析和优化建议(r6笔记第24天)_其他_11

性能问题分析:

The throughput of the I/Osubsystem was significantly lower than expected.

 

RECOMMENDATION 1: Host Configuration, 30% benefit (2038 seconds)

ACTION: Consider increasing the throughput of the I/O subsystem.

Oracle's recommended solution is to stripe all data file using the

SAME methodology. You might also need to increase the number of disks

for better performance. Alternatively, consider using Oracle's

Automatic Storage Management solution.

RATIONALE: During the analysis period, the average data files' I/O

throughput was 1.8 M persecond for reads and 3.3 M per second for

writes. The average response time for single block reads was 14

milliseconds.

 

SYMPTOMS THAT LED TO THE FINDING:

SYMPTOM: Wait class "User I/O" was consuming significantdatabase time.

(93% impact [6405 seconds])

改进建议:

开启异步IO

目前系统中aio配置存在,但是没有启用

$ cat /proc/sys/fs/aio-nr

65536

$ cat/proc/sys/fs/aio-max-nr

65536

$ /usr/bin/ldd $ORACLE_HOME/bin/oracle | greplibaio

libaio.so.1 => /usr/lib64/libaio.so.1 (0x00002af9f4ad8000)

 

SQL> alter system setfilesystemio_options=setall scope=spfile;

后台Job的调度

需要进行确认是否可以重新选择一个低峰时间段来运行Job或者从后台禁用。按照时间频率进行统计信息的收集

SGA组件的调整

从内存组件的使用情况来看,shared pool的资源已经被buffer cache进行了压榨,可以适当调整一下shared pool的大小,比如设置为4G左右,目前仅为1G

内容根据情况看适度做了删减,可以看出来做一个数据库巡检的过程中其实还是需要花费不少的精力来分析问题,找到性能的瓶颈,这也是我们能够持续改进质量的基线。