本案例的产品环境是AIX 5.2操作系统,数据库版本为 Oracle 9.2.0.7。在一次常规的表分析后,系统出现异常,主机CPU 平均使用率达到了70%,极大影响了系统的使用与服务的提供。

以前系统的CPU使用率从未如此之高,而分析系统的变化,仅仅是执行过一个例行的表分析脚本,使用同样脚本做表分析从没有发生过这样的事情,该脚本被认为是安全的。

根据定期收集的Statspack数据生成报告,发现数据库的具体体现为CPU Time很高,这与主机体现吻合,Latch free紧跟CPU Time列在第二位,这也是一个高CPU消耗的数据库事件(采样周期为15分钟):

Top 5 Timed Events

~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time(s) Ela Time
-------------------------------------------- ------------ ----------- --------
CPU time 4,598 60.61
latch free 183,374 1,905 25.11
log file sync 131,171 345 4.55
db file sequential read 30,745 295 3.88
direct path write 126,462 123 1.62

-------------------------------------------------------------

在随后的Top SQL中发现一条异常的SQL(了解一个稳定系统中常见SQL的性能非常重要),其中Hash Value=2011675572,这条SQL执行一个单表访问,每次执行的逻辑读大约在316次左右:

SQL ordered by Gets forDB:

Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value

--------------- -------------------------- ------ -------- --------- ----------
9,219,425 29,158 316.2 15.9 304.47 425.922011675572

Module: InSiteXMLServer.exe

SELECT MfgLineEntry.MfgLineEntryIdFROM mesw.MfgLineEntry WHERE

MfgLineEntry.MfgLineId = :"SYS_B_0"ORDER BY MfgLineEntry.Priority

从全局来看,这个SQL的逻辑读数量不算太高,但是根据经验以及对于数据库的了解,我们知道这个SQL的执行可能出现了问题。通过V$SQL_PLAN查看这个SQL的执行计划,发现数据库使用了全表扫描来执行这个SQL:

SQL> select operation,options,object_name,optimizer

2 from v$sql_plan wherehash_value='2011675572';

OPERATION OPTIONS OBJECT_NAME OPTIMIZER
----------------- ----------- --------------- ----------
SELECT STATEMENT CHOOSE

SORT ORDERBY

TABLE ACCESS FULL MFGLINEENTRY ANALYZED

格式化一下前面的SQL便于分析:

SELECT mfglineentry.mfglineentryid
FROM mesw.mfglineentry WHEREmfglineentry.mfglineid = :"SYS_B_0"
ORDER BY mfglineentry.priority;

查询引用了MFGLINEID字段,而表MFGLINEENTRY上实际上有一个MFGLINEID和PRIORITY列的组合索引,在查询中,这个索引没有被用到:

SQL> select dbms_metadata.get_ddl('INDEX','MFGLINEENTRY2')from dual;

DBMS_METADATA.GET_DDL('INDEX',

--------------------------------------------------------------------------------

CREATE INDEX "MESW"."MFGLINEENTRY2"ON "MESW"."MFGLINEENTRY" ("MFGLINEID", "PRIORITY")

PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING COMPUTESTATISTICS

STORAGE(INITIAL 1048576 NEXT 1048576 MINEXTENTS1 MAXEXTENTS 2147483645

PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOLDEFAULT)

TABLESPACE"MESI_8K_1M" PARALLEL 8

这个表记录数在1万行左右,使用极其频繁。与开发人员联系,得知正常情况下这个SQL返回最多100行左右,按理说不应该走全表扫描;使用autotrace查看返回57行时的执行计划及逻辑读如下:

SQL> SELECTmfglineentry.mfglineentryid

2 FROM mfglineentry

3 WHERE mfglineentry.mfglineid = '480013800006c7f2'

4 ORDER BY mfglineentry.priority;

57 rows selected.

Execution Plan

----------------------------------------------------------

0 SELECT STATEMENT Optimizer=CHOOSE (Cost=27 Card=56 Bytes=2128)

1 0 SORT (ORDER BY) (Cost=27Card=56 Bytes=2128)

2 1 TABLE ACCESS (FULL) OF'MFGLINEENTRY' (Cost=21 Card=56 Bytes=2128)

Statistics

----------------------------------------------------------

0 recursivecalls
0 db block gets
315 consistentgets
0 physical reads
0 redo size
1434 bytes sent via SQL*Net to client
298 bytesreceived via SQL*Net from client
5 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
57 rowsprocessed

发现查询在返回57行的时候也是全表扫描,逻辑读315次(这和statspack中的结果一致),为了分析原因,我们首先尝试使用hints强制使用索引MFGLINEENTRY2,看看SQL的执行效率:

SQL> SELECT/*+INDEX(mfglineentry, MFGLINEENTRY2 )*/ mfglineentry.mfglineentryid

2 FROM mfglineentry

3 WHERE mfglineentry.mfglineid = '480013800006c7f2'

4 ORDERBY mfglineentry.priority;

57 rows selected.

Execution Plan

----------------------------------------------------------

0 SELECT STATEMENT Optimizer=CHOOSE (Cost=35 Card=56 Bytes=2128)

1 0 TABLE ACCESS (BY INDEX ROWID)OF 'MFGLINEENTRY' (Cost=35 Card=56 Bytes=2128)

2 1 INDEX (RANGE SCAN) OF'MFGLINEENTRY2' (NON-UNIQUE) (Cost=2 Card=56)

Statistics

----------------------------------------------------------

0 recursivecalls
0 db block gets
51 consistentgets

0 physical reads
0 redo size
1466 bytessent via SQL*Net to client
298 bytesreceived via SQL*Net from client
5 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts(disk)
57 rowsprocessed

此时的逻辑读只有51次,在效率上比全表扫描要好;那么为什么CBO选择了TABLE ACCESS FULL呢?

查看一下USER_TABLES中记录的表统计数据:

SQL> SELECT BLOCKS,NUM_ROWSFROM USER_TABLES WHERE TABLE_NAME='MFGLINEENTRY';

BLOCKS NUM_ROWS
---------- ----------
312 10649

SQL> SELECT CLUSTERING_FACTORFROM USER_INDEXES WHERE INDEX_NAME='MFGLINEENTRY2';

CLUSTERING_FACTOR

-----------------

5758

从这里可以看出由于MFGLINEENTRY2的CLUSTERING_FACTOR的值很大,导致CBO计算通过索引返表查询的成本过高,因而使用了TABLE ACCESSFULL。

在测试环境中人为修改了CLUSTERING_FACTOR证实了这个想法:

SQL> EXEC DBMS_STATS.set_index_stats(ownname=> 'MESW',indname => 'MFGLINEENTRY2',clstfct => 1000);

PL/SQLprocedure successfully completed

SQL>set autotrace traceonly

SQL>SELECT mfglineentry.mfglineentryid

2 FROM mfglineentry

3 WHERE mfglineentry.mfglineid= '480013800006c7f2'

4 ORDER BY mfglineentry.priority

5 /

57rows selected.

ExecutionPlan

----------------------------------------------------------

0 SELECT STATEMENT Optimizer=CHOOSE (Cost=35 Card=56 Bytes=2128)

1 0 TABLE ACCESS (BY INDEX ROWID) OF 'MFGLINEENTRY'(Cost=35 Card=56 Bytes=2128)

2 1 INDEX (RANGE SCAN) OF 'MFGLINEENTRY2' (NON-UNIQUE)(Cost=2 Card=56)

Statistics

----------------------------------------------------------

0 recursive calls
0 db block gets
51 consistentgets
0 physical reads
0 redo size
1466 bytes sent via SQL*Net to client
298 bytes receivedvia SQL*Net from client
5 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
57 rows processed

联系开发人员得知,前一段时间这张表加载了一些比较随机的数据,导致CLUSTERING_FACTOR变大,最近的表分析使统计信息发生了变化,所以执行计划变成了全表扫描。

为安全起见,这个表增加了索引:

SQL> selectdbms_metadata.get_ddl('INDEX','MFGLINEENTRY3') from dual;

DBMS_METADATA.GET_DDL('INDEX',

--------------------------------------------------------------------------------

CREATE INDEX"MESW"."MFGLINEENTRY3" ON"MESW"."MFGLINEENTRY"

("MFGLINEID", "PRIORITY","MFGLINEENTRYID")

PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGINGCOMPUTE STATISTICS

STORAGE(INITIAL 1048576 NEXT 1048576MINEXTENTS 1 MAXEXTENTS 2147483645

PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1BUFFER_POOL DEFAULT)

TABLESPACE "MESI_8K_1M" PARALLEL 8

使这个SQL的执行计划变为不返表不排序的MFGLINEENTRY3 INDEXRANGE SCAN来减少逻辑读。

SQL> SELECTmfglineentry.mfglineentryid

2 FROM mfglineentry

3 WHEREmfglineentry.mfglineid = '480013800006c7f2'

4 ORDERBY mfglineentry.priority

5 /

57 rows selected.

Execution Plan

----------------------------------------------------------

0 SELECT STATEMENT Optimizer=CHOOSE (Cost=3 Card=56 Bytes=2128)

1 0 INDEX (RANGE SCAN) OF'MFGLINEENTRY3' (NON-UNIQUE) (Cost=3 Card=56 Bytes=2128)

Statistics

----------------------------------------------------------

0 recursivecalls
0 db block gets
9 consistent gets
0 physical reads
0 redo size
1434 bytessent via SQL*Net to client
298 bytesreceived via SQL*Net from client
5 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
57 rowsprocessed

建立索引后CPU使用率下降有限,继续看statspack 发现。

Instance EfficiencyPercentages (Target 100%)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Buffer Nowait %: 99.98 Redo NoWait %: 100.00
Buffer Hit %: 99.27 In-memorySort %: 100.00
Library Hit %: 96.42 Soft Parse %: 87.50
Executeto Parse %: 81.83 LatchHit %: 99.46
Parse CPU to Parse Elapsd %: 33.37 % Non-Parse CPU: 89.61

软解析率很低,只有87.50%,说明数据库中存在过多的硬解析。

Latch Sleepbreakdown for DB: MESWP Instance:meswp Snaps: 2512 -2513

-> ordered bymisses desc

Get Spin &

Latch Name Requests Misses Sleeps Sleeps 1->4
-------------------------- -------------- ----------- ----------- ------------
library cache 11,487,159 109,504 24,235 85521/23736/242/5/0
session allocation 9,506,763 99,905 1,459 98447/1457/1/0/0
enqueues 5,947,923 42,175 2,062 40114/2060/1/0/0
shared pool 4,667,481 34,604 892 33716/884/4/0/0
row cache objects 6,959,017 33,322 52 33270/52/0/0/0
row cache enqueue latch 5,801,806 29,754 25 29729/25/0/0/0

查看latch部分的内容,Library cache latch 竞争很严重,同时在SQL ordered by Version Count部分,出现了大量SQL的version count,并且这些SQL都是根据主键的等值查询。

文章来源:《OracleDBA手记1》第17章聚簇因子、柱状图与执行计划 作者:段林仲

配图来源:http://www.itsne.com/image_upload/179693981.jpg