上午有客户遇到UNDO空间问题导致的latch: row cache objects (dc_rollback_segments ),经排查为节点2的UNDO表空间使用率100%,大部分为UNEXPIRED状态的UNDO段。虽然UNDO表空间的数据文件支持自动扩展,还是出现严重的等待,DML操作均很缓慢,通过人工扩大UNDO表空间后恢复正常;整个排查思路总体为:

1.反映数据库慢时,对当前活动会话、等待事件的排查,找出锁或异常等待事件对应的进程,再进一步分析(此步骤可以同步收集AWR,或通过ASH来确定问题开始时间)。

2.根据等待事件latch: row cache objects (dc_rollback_segments )进一步分析UNDO的配置,如undo表空间的不同状态段的大小分布、undo_retention及其它UNDO_TUNE相关参数的检查。

3.确认UNDO问题后扩容,问题恢复。

4.之后分析UND中存在大量UNEXPIRED状态段的原因。如何分析?DML肯定产生REDO日志或BLOCK CHANGE,因此可以通过查看REDO日志产生量(归档日志),BLOCK CHANGE来定位出某个时间点事务量大;再分析此时间段的AWR/ASH来排查大的事务操作(如查AWR的Segments by DB Blocks Changes中表)是人为或业务的等思路。

1.查询等待事件
SQL> select inst_id,event,count(1) from gv$session where wait_class#<> 6 group by inst_id,event order by 1,3;

INST_ID EVENT COUNT(1)
---------- -------------------------------------------------- ----------
1 enq: SV - contention 1
1 DFS lock handle 1
1 enq: TX - row lock contention 2
1 gc cr request 2
1 SQL*Net break/reset to client 3
1 SQL*Net message to client 6
2 SQL*Net message from dblink 1
2 DFS lock handle 1
2 CGS wait for IPC msg 1
2 enq: TX - row lock contention 1
2 enq: US - contention 1
2 SQL*Net break/reset to client 1
2 gc cr request 1
2 enq: SV - contention 1
2 latch free 2
2 SQL*Net message to client 2
2 latch: row cache objects 104

17 rows selected.

2.查询latch: row cache objects的P1值找出对应的LATCH NAME:dc_rollback_segments
SQL> select count(*),P1RAW from gv$session where event = 'latch: row cache objects' group by P1RAW;

COUNT(*) P1RAW
---------- ----------------
106 0000003A542DFA10

SQL> SELECT
2 kqrsttxt PARAMETER,
3 -- kqrstcid CACHE#,
4 kqrstcln "Lchild#",
5 kqrstgrq "DCGets",
6 l.gets "LGets",
7 l.misses "Misses"
8 FROM X$KQRST, V$LATCH_CHILDREN l
9 WHERE l.addr='&P1RAW'
10 and l.child#=KQRSTCLN
11 ORDER BY 1,2
12 ;
Enter value for p1raw: 0000003A542DFA10
old 9: WHERE l.addr='&P1RAW'
new 9: WHERE l.addr='0000003A542DFA10'

PARAMETER Lchild# DCGets LGets Misses
-------------------------------- ---------- ---------- ---------- ----------
dc_rollback_segments 1 765040834 4272819637 2783160470

SQL> select * from V$LATCHHOLDER;

PID SID LADDR NAME GETS
---------- ---------- ---------------- ---------------------------------------------------------------- ----------
39 7333 0000003A542DFA10 row cache objects 4293097299
1974 10193 0000003B3E90ACC8 cache buffers chains 70589673
2186 4908 0000003B3DD03800 cache buffers chains 36715147
2318 14665 0000003ACB82E6C8 cache buffers chains 97856295
2567 1372 0000003A542DFA10 row cache objects 4293097807

3.查UNDO配置
SQL> SQL> show parameter undo

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
_gc_undo_affinity boolean FALSE
_highthreshold_undoretention integer 259200
_undo_autotune boolean FALSE
undo_management string AUTO
undo_retention integer 259200
undo_tablespace string UNDOTBS2
Tablespace_Name Size(GB) Status Used Extents US_SIZE(GB) Used R
--------------- --------- ---------- ------------ ----------- ------
UNDOTBS1 111.00 ACTIVE 211 .22 .19
UNDOTBS1 111.00 EXPIRED 50830 36.44 32.75
UNDOTBS1 111.00 UNEXPIRED 35975 31.1 27.96
UNDOTBS2 58.00 ACTIVE 239 .19 .32
UNDOTBS2 58.00 EXPIRED 12200 2 3.46
UNDOTBS2 58.00 UNEXPIRED 141196 52.18 90.11

4.查undo_retention 时间范围内具体哪个时间点的事务量大,从而找出对应的问题程序或SQL语句/表名等~~,排查思路见开头,SQL就省略了。