APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.1.0.7 to 11.2.0.3 [Release 11.1 to 11.2]


Information in this document applies to any platform.


SYMPTOMS

- RAC Instances freezes during DRM for 100 secs or more.

- DB Alert log shows that all RAC instances undergo reconfiguration at the same time, but there are no instance crashes

Node 1 DB Alert Log

Node 2 DB Alert Log

Sat Jul 14 14:17:04 2012

Reconfiguration started (old inc 70, new inc 72)

List of instances:

1 2 (myinst: 1) 

Global Resource Directory frozen

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Sat Jul 14 14:17:04 2012

LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived

Sat Jul 14 14:17:04 2012

LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived

Set master node info 

Submitted all remote-enqueue requests

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

Sat Jul 14 14:17:13 2012

minact-scn: Master returning as live inst:2 has inc# mismatch instinc:70 cur:72 errcnt:0

Sat Jul 14 14:17:04 2012

Reconfiguration started (old inc 70, new inc 72)

List of instances:

1 2 (myinst: 2) 

Global Resource Directory frozen

Communication channels reestablished

Sat Jul 14 14:17:04 2012

* domain 0 valid = 1 according to instance 1 

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Sat Jul 14 14:17:04 2012

LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived

Sat Jul 14 14:17:04 2012

LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived

Set master node info 

Submitted all remote-enqueue requests

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

Sat Jul 14 14:18:03 2012

Submitted all GCS remote-cache requests

 

 

 

 

 

 

 

 

 

 

 

 

 

 

- Lmon trace shows that DRM quiesce step hangs:

*** 2012-07-14 14:14:51.187
 CGS recovery timeout = 85 sec
Begin DRM(231) (swin 1)
* drm quiesce

*** 2012-07-14 14:17:03.752
* Request pseudo reconfig due to drm quiesce hang 
2012-07-14 14:17:03.752735 : kjfspseudorcfg: requested with reason 5(DRM Quiesce step stall)

*** 2012-07-14 14:17:03.766
kjxgmrcfg: Reconfiguration started, type 6
CGS/IMR TIMEOUTS:
 CSS recovery timeout = 31 sec (Total CSS waittime = 65)
 IMR Reconfig timeout = 75 sec
 CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 70 0.

 

- AWR Top waits are "gcs resource directory to be unfrozen" & "gc remaster"

CHANGES

Large Buffer Cache

CAUSE

This is caused by bug:
Bug 12879027

DRM has a number of steps. During the DRM quiesce step all ongoing block transfers for remastering are completed.
In this case, during the DRM quiesce step a hang occured due to an internal function hitting a timeout.
This is a bug condition that happens when the buffer cache is very large.

This hang then triggers a pseudoreconfiguration to prevent the instance from being killed by another instance.
This is the reason for the instance undergoing a reconfiguration without restarting.

SOLUTION

Apply the fix for bug 12879027.

 

This issue is fixed in the following DB PSU patches:

      Patch 13923374 - 11.2.0.3.3 DB Patch Set Update (PSU)
      Patch 13923804 - 11.2.0.2.7 DB Patch Set Update (PSU)

Also bundled in the corresponding GI PSU

REFERENCES


NOTE:756671.1

 - Oracle Recommended Patches -- Oracle Database

NOTE:390483.1

 - DRM - Dynamic Resource management

BUG:12879027

 - LMON PROCESS CAN GET STUCK IN DRM QUIESCE STEP TRIGGERING PSEUDO RECONFIGURATION