故障信息:接到用户电话反应某MES系统业务反映慢,登陆数据库所在主机(Windows+Oracle11.2.0.1),发现cpu使用率一直100%,登陆数据库查询等检查操作时也比较缓慢。 

总的问题处理思路: 

1.查当前等待事件 

2.分析AWR中TOP等待cursor: mutex S 等待事件,Time Model Statistics及SQL ordered by Version Count 

3.查cursor: mutex S 的P1值对应SQL HASH_VALUE找出相应SQL语句是insert into sys.aud$ 

4.紧急处理:刷共享池后CPU使用率下降到正常范围

 5.监听日志分析,未出现连接的会话明显的增长

 6.查异常SQL不能共享原因 

7.后续处理措施:停机修改 audit_trail=NONE 

8.从MOS上找到对应的BUG信息论证此问题 

详细处理过程:

 1.远程上去看了下,任务管理器能看到的CPU使用率一直100%,其它内存磁盘等使用正常;登陆数据库查看等待事件主要是cursor: mutex S;又检查切了下归档、做检查点/alert日志等看看均正常,马上17点了,收集了一小时AWR,ASH收的时候报错就没拿到; 

2.AWR 分析如下:

Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 25907 08-11月-17 16:00:42 355 1.1
End Snap: 25908 08-11月-17 17:00:25 413 1.1
Elapsed: 59.71 (mins)
DB Time: 18,476.37 (mins)

--DB Time一看就异常了,继续往下: Cache Sizes/Load Profile/内存使用均未现异常,TOP事件处就异常了:

Top 5 Timed Foreground Events
Event Waits Time(s) Avg wait (ms) % DB time Wait Class
cursor: mutex S 24,415,300 657,967 27 59.35 Concurrency
library cache lock 37,388 281,185 7521 25.36 Concurrency
DB CPU 146,580 13.22
library cache: mutex X 14,609 28,642 1961 2.58 Concurrency
row cache lock 2 3,618 1809234 0.33 Concurrency
Host CPU (CPUs: 40 Cores: 20 Sockets: 2)

Load Average Begin Load Average End %User %System %WIO %Idle
99.3 0.7 0.0
Instance CPU

%Total CPU %Busy CPU %DB time waiting for CPU (Resource Manager)
101.9 101.9 0.0

--可以发现主要等待事件cursor: mutex占了60%,等待次数非常多,library cache lock占25%,继续查看Time Model Statistics部分,

Statistic Name Time (s) % of DB Time
connection management call elapsed time 1,088,499.04 98.19
parse time elapsed 1,084,908.88 97.86
DB CPU 146,580.38 13.22
sql execute elapsed time 19,866.00 1.79

--可以发现主要的时间消耗在连接管理、解析,可是看前面的load profile,平均每秒登陆数量是0.1.不太正常。继续查看sql部分:

lapsed Time (s) Executions Elapsed Time per Exec (s) %Total %CPU %IO SQL Id SQL Module SQL Text
3,612.86 0 0.33 99.94 0.00 9rj2zcg8nrrx8 Server.exe select distinct (a.work_order)...
3,595.63 7 513.66 0.32 82.91 0.01 cavs50wyzrvqr DECLARE job BINARY_INTEGER := ...
2,084.83 2,527 0.83 0.19 83.46 0.00 drjw3ky61mnwj SELECT COUNT(1), (SELECT SP.OP...

SQL ordered by Version Count
Only Statements with Version Count greater than 20 are displayed
Version Count Executions SQL Id SQL Module SQL Text
2,420 4vs91dcv7u1p6 ** SQL Text Not Available ** ---->>>>这个SQL ID非常高
53 61 d3u55hdjw35ua APService.exe select * from (select m.pdline...
47 61 48hcnqyfb32cf APService.exe select h.pdline_name, m.work_o...

--SQL执行部分没大异常,但是 Version Count就异常了,但是没有SQL 文本。继续查找…… 

3.排查TOP事件cursor: mutex S事件及Version Count异常的SQL_ID  4vs91dcv7u1p6 

对于cursor: mutex S事件和P1/P2/P3值,MOS文档:WAITEVENT: "cursor: pin S" Reference Note (文档 ID 1310764.1)上有写, Versions 10.2.0.1 - 11.2.0.2 P1 = idn P2 = value P3 = where (where|sleeps in 10.2) idn Mutex identifier (gives hash_value of SQL) This is the mutex identifier value which matches to the HASH_VALUE of the SQL statement that we are waiting to get the mutex on. 简单来说就是P1的值就是blocker进程的SQL HASH_VALUE

SQL> select * from (
2 select p1, sql_id,event,count(*),(ratio_to_report(count(*)) over ())*100 pct
3 from V$ACTIVE_SESSION_HISTORY
4 where event like '%mutex%'
5 and sample_time > sysdate-1/12
6 group by p1, sql_id,event
7 order by count(*) desc)
8 where rownum <= 10;

P1 SQL_ID EVENT COUNT(*) PCT
---------- -------------------------- ------------------------- ---------- ----------
914163366 cursor: mutex S 141945 97.6224536 ---->>>主要在这里,占97.6%
1512608110 library cache: mutex X 3452 2.37410765
37230 library cache: mutex X 5 .003438742

可以发现最近两小时主要等待事件cursor: mutex S被捕获到V$ACTIVE_SESSION_HISTORY中141945次,对应的P1是914163366(查最近2小时的有143424次,也就是问题主要出现在最近一小时)。根据P1值对应的SQL HASH_VALUE来查出导致此等待的SQL语句及进程:

SQL> SELECT sql_id, sql_text, version_count FROM V$SQLAREA where HASH_VALUE=914163366;

SQL_ID
--------------------------
SQL_TEXT
------------------------------------------------------
VERSION_COUNT
-------------
4vs91dcv7u1p6
insert into sys.aud$( sessionid,entryid,statement,ntimestamp#, userid,userhost,terminal,action#,returncode, obj$creator,
obj$name,auth$privileges,auth$grantee, new$owner,new$name,ses$actions,ses$tid,lo
goff$pread, logoff$lwrite,logoff$dead,comment$text,spare1,spare2, priv$used,clientid,sessioncpu,proxy$sid,user$guid, in
stance#,process#,xid,scn,auditid, sqlbind,sqltext,obj$edition,dbid) values(:1,
:2,:3,SYS_EXTRACT_UTC(SYSTIMESTAMP), :4,:5,:6,:7,:8, :9,:10,:11,:12, :13,:14,:15,:16,:17, :18,:19,:20,:2
1,:22, :23,:24,:25,:26,:27, :28,:29,:30,:31,:32, :33,:34,:35,:36)
5546

--可以发现出问题的是insert into sys.aud$,此语句有version_count高达5546个,也验证了AWR 中SQL Version Count中SQL的问题。 

4.紧急处理办法:alter system flush shared_pool; 刷共享池几分钟后CPU使用率恢复正常。当然这里首先是对这个业务我有一些了解正常时候业务压力不算太大,其次当时系统cpu100%业务也进行不了,所以果断刷了共享池,高负载多业务并存的系统慎用。

 5.监听日志分析, 分析了当天及出现问题的1小时的监听日志,未出现连接会话明显的增长 使用语句:

fgrep "08-11" listener.log |fgrep "establish" |awk '{print $1 " " $2}' |awk -F: '{print $1 }' |sort |uniq -c
$ fgrep " 16:" listener2.log |fgrep "establish" |awk '{print $1 " " $2}'|awk -F: '{print $1 ":" $2 }' |sort |uniq -c</span>

6.后续处理措施:停机修改 audit_trail=NONE,关闭审计。 --当前版本是11.2.0.1,还建议升级(当然近期做不了),升级到11.2.0.3就可以修复后面提到与此类似的BUG 11936699。 查看当前的配置是使用了默认的审计级别audit_trail=DB:当晚又协调了停机时间,将audit_trail关闭,audit_trail=none;

7.查异常SQL不能共享原因 在紧急处理alter system flush shared_pool;几分钟后,系统CPU使用率已经恢复正常,后续又查询4vs91dcv7u1p6 insert into sys.aud$的多版本情况,发现又有了10几个version,从v$sql_shared_cursor查询游标失效原因是optimizer_mismatch,未找到是什么原因引起.

SQL> select sql_id,child_number,optimizer_mismatch,sql_type_mismatch from 
v$sql_shared_cursor where sql_id='4vs91dcv7u1p6' and rownum<100;

SQL_ID CHILD_NUMBER OP SQ
-------------------------- ------------ -- --
4vs91dcv7u1p6 1 Y N
4vs91dcv7u1p6 2 Y N
4vs91dcv7u1p6 3 N N
4vs91dcv7u1p6 86 N N

8.从MOS上找到对应的BUG信息论证此问题关于游标不能共享原因的分析: 从MOS上以aud$ cursor: mutex S关键字搜索,有文档:

Database Hangs With Excessive Cursor Mutex S Waits Due to Sys.Aud$ Cursors (文档 ID 1423386.1)
里面描述的症状是由于Bug 11936699 Abstract: WAIT TIME OF LIBRARY CACHE LOCK INCREASES DUE TO MANY CHILD CURSORS OF SYS.AUD$,
导致数据库出现如下症状:
Database hangs with excessive waits for 'cursor mutex S'
AWR report shows:
Top wait event: 'cursor: mutex S' and 'library cache lock'
high version count for an insert into sys.aud$

BUG中涉及的数据库版本、SQL语句也与本次相同,不同之处是本次数据库未完全hang住,还是能登陆的,只是操作时反应比较慢。 由于当时没有相应的errorstack Trace来核对 call stack,只能说与此BUG比较类似,不能完全确认。 当然此BUG的后续处理里介绍除了打补丁,也是修改audit_trail=DB -> audit_trail=OS; 我直接简单粗爆的关闭:audit_trail=DB -> audit_trail=NONE,过了将近一周复查一下未再出现high version count的insert into aud$ SQL语句。 参考文档:

Database Hangs With Excessive Cursor Mutex S Waits Due to Sys.Aud$ Cursors (文档 ID 1423386.1)
Bug 11936699 : WAIT TIME OF LIBRARY CACHE LOCK INCREASES DUE TO MANY CHILD CURSORS OF SYS.AUD$
常见问题:'cursor:mutex ..'/ 'cursor:pin ..'/ 'library cache:mutex ..'类型的等待事件 (文档 ID 1525791.1)
Troubleshooting: Waits for Mutex Type Events (文档 ID 1377998.1)
WAITEVENT: "cursor: pin S" Reference Note (文档 ID 1310764.1)