环境
OS:SunOS hostname 5.10 Generic_127111-11 sun4u sparc SUNW,Sun-Fire-V490
DB: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production

一个JOB做MVIEW refresh做了十多个小时,还没做完,这个JOB定义:每隔20分钟去涮新一次,发现现在JOB状态为broken了,显然上次refresh完全在20分钟内完成;先查一下alert,发现日志切换太频繁了,2到3分钟就切换一下,online redo才50M(也不知道那个家伙这么设置的),这个显示太小了,把它扩展到500M,性能并没有什么好转.

[ @more @]

查下JOB的具体名为内容:
dbms_refresh.refresh('"SCHEM_NAME"."REFGRP"');
再根据refresh group查一下dba_refresh_children,返回441个对象,这样的话一次refresh涮新441个对象,如果每个对象更新的数据量多,而中途中断,这个事务是会非常大的,回滚将会是很耗时间的。
查看一下等待事件,发现大量的:“wait for a undo record”和"free buffer waits":
根据free buffer waits去看一下操作对象都是undo段。查一下v$session_longops,还没完成的事务OPNAME为transaction rollback segment,而对应的session正在跑那个JOB,这个很可能是refresh出现异常,致使整个事务回滚。
根据“wait for a undo record”,发现UNDO表空间已经扩搌到30GB.
这个JOB不是第一次跑,按以往的规律来看,正常情况下是能完成refresh的.会不会被什么阻塞了,查下发现,竟然SMON还能被正在跑的那个JOB阻塞了,查一下SMON当前等待事件,是“wait for stopper event to be increased”,这个事件比较陌生,以前根本就没有遇到过。 奇怪的是:竟然SMON还能被job阻塞?根据这事件查一下metalink,发出一篇很相关的文档:464246.1

Sometimes Parallel Rollback of Large Transaction may become very slow. After killing a large running transaction (either by killing the shadow process or aborting the database) then database seems to hang, or smon and parallel query servers taking all the available cpu.

In fast-start parallel rollback, the background process Smon acts as a coordinator and rolls back a set of transactions in parallel using multiple server processes. Fast start parallel rollback is mainly useful when a system has transactions that run a long time before comitting, especially parallel Inserts, Updates, Deletes operations. When Smon discovers that the amount of recovery work is above a certain threshold, it automatically begins parallel rollback by dispersing the work among several parallel processes.

There are cases where parallel transaction recovery is not as fast as serial transaction recovery, because the pq slaves are interfering with each other. It looks like the changes made by this transaction cannot be recovered in parallel without causing a performance problem. The parallel rollback slave processes are most likely contending for the same resource, which results in even worse rollback performance compared to a serial rollback.

解决的办法是:fast_start_parallel_rollback = false,该参数默认的是:LOW。

在不重新启动数据库的情况下:发出"alter system set fast_start_parallel_rollback = false scope=memory"后,所有的session都"enq: PE - contention".等了很长的时间,未能修改成功,只能重新启动数据库,同时把这个参数设置成false。

从整个事件来看,是由于一个大的事务回滚,回滚量相当大,当达到一个阀值,SMON就会用并发进程来处理,但文档上没提到这个阀值是多少。整个事件,参与的人比较多,对数据的操作肯定比较乱,我推断:MVIEW refresh做了很长的时间,业务人员很有可能中途强行中断,然后重新涮新,那之前的refresh得回滚,但这个恢复工作量是比较大的,导致并发进程来恢复,而这些进程由于互相干扰,结果整个恢复比单进程恢复还慢!