一个客户的11.2 RAC for Linux X86-64环境停电后出现了故障,RAC环境无法自动启动。






登录数据库服务器后,发现CLUSTER进程存在,于是尝试手工启动数据库:


$ sqlplus / as sysdba


SQL*Plus: Release 11.2.0.2.0 Production on Sun May 29 20:17:16 2011


Copyright (c) 1982, 2010, Oracle. All rights reserved.


Connected to an idle instance.


SQL> startup


ORA-01078: failure in processing system parameters


ORA-01565: error in identifying file '+DG_DATA/SPSP/spfileSPSP.ora'


ORA-17503: ksfdopn:2 Failed to open file +DG_DATA/SPSP/spfileSPSP.ora


ORA-15056: additional error message


ORA-17503: ksfdopn:2 Failed to open file +DG_DATA/spsp/spfilespsp.ora


ORA-15001: diskgroup "DG_DATA" does not exist or is not mounted


ORA-06512: at line 4


看来问题出在ASM磁盘组上,检查ASM磁盘组的告警日志:


NOTE: client SPSP2:SPSP registered, osid 14319, mbr 0x1


Sun May 29 16:58:31 2011


WARNING: Write Failed. group:1 disk:0 AU:12 offset:856064 size:4096


NOTE: unable to write any mirror side for diskgroup DG_ARCH


NOTE: cache initiating offline of disk 0 group DG_ARCH


NOTE: process 13597 initiating offline of disk 0.3915929597 (ARCH1) with mask 0x7e in group 1


Sun May 29 16:58:31 2011


WARNING: Disk ARCH1 in mode 0x7f is now being offlined


WARNING: Disk ARCH1 in mode 0x7f is now being taken offline


NOTE: initiating PST update: grp = 1, dsk = 0/0xe96857fd, mode = 0x15


GMON updating disk modes for group 1 at 21 for pid 15, osid 13597


ERROR: too many offline disks in PST (grp 1)


Sun May 29 16:58:31 2011


NOTE: cache dismounting (not clean) group 1/0x9C08A705 (DG_ARCH)


WARNING: Disk ARCH1 in mode 0x7f offline aborted


WARNING: Offline for disk ARCH1 in mode 0x7f failed.


Sun May 29 16:58:31 2011


NOTE: halting all I/Os to diskgroup 1 (DG_ARCH)


NOTE: unable to offline disks after getting write error for diskgroup DG_ARCH


NOTE: disk 0 had IO error


NOTE: LGWR doing non-clean dismount of group 1 (DG_ARCH)


NOTE: LGWR sync ABA=237.2255 last written ABA 237.2256


Sun May 29 16:58:32 2011


ERROR: ORA-15130 in COD recovery for diskgroup 1/0x9c08a705 (DG_ARCH)


ERROR: ORA-15130 thrown in RBAL for group number 1


Errors in file /oracle/gridbase/diag/asm/+asm/+ASM2/trace/+ASM2_rbal_13603.trc:


ORA-15130: diskgroup "DG_ARCH" is being dismounted


WARNING: Read Failed. group:1 disk:0 AU:1 offset:4096 size:4096


WARNING: Read Failed. group:1 disk:0 AU:1 offset:0 size:4096


ERROR: no PST quorum in group: required 1, found 0


ERROR: Could not heartbeat PST for grp 1. Force dismounting the disk group.


WARNING: Write Failed. group:2 disk:2 AU:1 offset:1044480 size:4096


WARNING: Write Failed. group:2 disk:1 AU:1 offset:1044480 size:4096


WARNING: Write Failed. group:2 disk:0 AU:1 offset:1044480 size:4096


WARNING: disk 0.3915929598 (VD1) not responding to heart beat


WARNING: disk 1.3915929599 (VD2) not responding to heart beat


ERROR: too many offline disks in PST (grp 2)


WARNING: disk 2.3915929600 (VD3) not responding to heart beat


ERROR: too many offline disks in PST (grp 2)


Sun May 29 16:58:32 2011


Errors in file /oracle/gridbase/diag/asm/+asm/+ASM2/trace/+ASM2_b001_14488.trc:


ORA-15130: diskgroup "" is being dismounted


WARNING: Write Failed. group:3 disk:0 AU:1 offset:1044480 size:4096


WARNING: disk 0.3915929601 (DATA1) not responding to heart beat


ERROR: too many offline disks in PST (grp 3)


WARNING: Write Failed. group:4 disk:0 AU:1 offset:1044480 size:4096


WARNING: disk 0.3915929603 (FLSH1) not responding to heart beat


ERROR: too many offline disks in PST (grp 4)


WARNING: Write Failed. group:5 disk:0 AU:1 offset:1044480 size:4096


WARNING: disk 0.3915929604 (REDO1) not responding to heart beat


ERROR: too many offline disks in PST (grp 5)


Sun May 29 16:58:32 2011


NOTE: process 14490 initiating offline of disk 0.3915929598 (VD1) with mask 0x7e in group 2


NOTE: process 14490 initiating offline of disk 1.3915929599 (VD2) with mask 0x7e in group 2


NOTE: process 14490 initiating offline of disk 2.3915929600 (VD3) with mask 0x7e in group 2


NOTE: checking PST: grp = 2


.


.


.


freeing rdom 4


WARNING: dirty detached from domain 4


NOTE: cache dismounted group 4/0x9C28A708 (DG_FLSH)


SQL> alter diskgroup DG_FLSH dismount force /* ASM SERVER */


NOTE: cache deleting context for group DG_FLSH 4/0x9c28a708


GMON dismounting group 4 at 30 for pid 38, osid 14494


NOTE: Disk in mode 0x8 marked for de-assignment


SUCCESS: diskgroup DG_FLSH was dismounted


SUCCESS: alter diskgroup DG_FLSH dismount force /* ASM SERVER */


ERROR: PST-initiated MANDATORY DISMOUNT of group DG_FLSH


NOTE: diskgroup resource ora.DG_FLSH.dg is offline


List of instances:


2


Dirty detach reconfiguration started (new ddet inc 1, cluster inc 2)


Global Resource Directory partially frozen for dirty detach


* dirty detach - domain 5 invalid = TRUE


19 GCS resources traversed, 0 cancelled


Dirty Detach Reconfiguration complete


freeing rdom 5


WARNING: dirty detached from domain 5


NOTE: cache dismounted group 5/0x9C38A709 (DG_REDO)


SQL> alter diskgroup DG_REDO dismount force /* ASM SERVER */


NOTE: cache deleting context for group DG_REDO 5/0x9c38a709


GMON dismounting group 5 at 31 for pid 39, osid 14496


NOTE: Disk in mode 0x8 marked for de-assignment


SUCCESS: diskgroup DG_REDO was dismounted


SUCCESS: alter diskgroup DG_REDO dismount force /* ASM SERVER */


ERROR: PST-initiated MANDATORY DISMOUNT of group DG_REDO


NOTE: diskgroup resource ora.DG_REDO.dg is offline


Sun May 29 16:58:36 2011


Errors in file /oracle/gridbase/diag/asm/+asm/+ASM2/trace/+ASM2_ora_13637.trc:


ORA-15078: ASM diskgroup was forcibly dismounted


NOTE: client exited [13627]


Sun May 29 16:58:40 2011


Shutting down instance (abort)


License high water mark = 15


USER (ospid: 13624): terminating the instance


Instance terminated by USER, pid = 13624


Sun May 29 16:58:41 2011


Instance shutdown complete


根据错误信息,ASM在尝试写磁盘组的时候出现了IO错误。


进一步检查CLUSTER的日志信息:


2011-05-29 00:00:19.569


[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD3; details at (:CSSNM00059:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.


2011-05-29 00:00:19.570


[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD2; details at (:CSSNM00059:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.


2011-05-29 00:00:19.678


[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD1; details at (:CSSNM00059:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.


2011-05-29 00:00:22.584


[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD3; details at (:CSSNM00060:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.


2011-05-29 00:00:22.584


[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD2; details at (:CSSNM00060:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.


2011-05-29 00:00:22.684


[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD1; details at (:CSSNM00060:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.


2011-05-29 00:00:24.818


[/oracle/product/11g/grid/bin/orarootagent.bin(13818)]CRS-5822:Agent '/oracle/product/11g/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:1:4} in /oracle/product/11g/grid/log/oracle-01/agent/crsd/orarootagent_root/orarootagent_root.log.


2011-05-29 00:00:24.818


[/oracle/product/11g/grid/bin/oraagent.bin(13815)]CRS-5822:Agent '/oracle/product/11g/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:2:8} in /oracle/product/11g/grid/log/oracle-01/agent/crsd/oraagent_grid/oraagent_grid.log.


2011-05-29 00:00:25.244


[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.


2011-05-29 00:00:26.626


[crsd(14838)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.


2011-05-29 00:00:27.278


[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.


2011-05-29 00:00:28.620


[crsd(14855)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.


2011-05-29 00:00:29.306


[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.


2011-05-29 00:00:30.664


[crsd(14868)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.


2011-05-29 00:00:31.333


[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.


2011-05-29 00:00:32.692


[crsd(14881)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.


2011-05-29 00:00:33.367


[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.


.


.


.


2011-05-29 20:14:32.155


[crsd(14562)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.


2011-05-29 20:14:32.822


[ohasd(12719)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.


2011-05-29 20:14:34.195


[crsd(14575)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.


2011-05-29 20:14:34.849


[ohasd(12719)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.


2011-05-29 20:14:34.849


[ohasd(12719)]CRS-2771:Maximum restart attempts reached for resource 'ora.crsd'; will not restart.


2011-05-29 20:15:43.700


[cssd(13251)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:VD2 will be considered not functional in 99960 milliseconds


2011-05-29 20:15:43.701


[cssd(13251)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:VD3 will be considered not functional in 99960 milliseconds


2011-05-29 20:15:44.702


[cssd(13251)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:VD1 will be considered not functional in 99100 milliseconds


2011-05-29 20:16:33.800


[cssd(13251)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:VD2 will be considered not functional in 49870 milliseconds


2011-05-29 20:16:33.801


[cssd(13251)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:VD3 will be considered not functional in 49870 milliseconds


2011-05-29 20:16:34.802


[cssd(13251)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:VD1 will be considered not functional in 49000 milliseconds


2011-05-29 20:17:03.861


[cssd(13251)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:VD1 will be considered not functional in 19950 milliseconds


2011-05-29 20:17:03.861


[cssd(13251)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:VD2 will be considered not functional in 19810 milliseconds


2011-05-29 20:17:03.861


[cssd(13251)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:VD3 will be considered not functional in 19810 milliseconds


2011-05-29 20:17:23.900


[cssd(13251)]CRS-1604:CSSD voting file is offline: ORCL:VD1; details at (:CSSNM00058:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.


2011-05-29 20:17:23.901


[cssd(13251)]CRS-1604:CSSD voting file is offline: ORCL:VD2; details at (:CSSNM00058:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.


2011-05-29 20:17:23.901


[cssd(13251)]CRS-1604:CSSD voting file is offline: ORCL:VD3; details at (:CSSNM00058:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.


2011-05-29 20:17:23.901


[cssd(13251)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 2, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log


2011-05-29 20:17:23.901


[cssd(13251)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log


2011-05-29 20:17:23.952


[cssd(13251)]CRS-1652:Starting clean up of CRSD resources.


2011-05-29 20:17:24.133


[cssd(13251)]CRS-1653:The clean up of the CRSD resources failed.


根据错误信息,开始的时候,所有的磁盘组都可以加载到MOUNT状态,但是随后Oracle在尝试写VOT所在磁盘组时,出现了错误,导致OCR进程退出,而其他的磁盘组也因为错误陆续被OFFLINE。


而且由于无法获取VOT磁盘,很快节点被踢出CLUSTER环境,导致主机重启。


最后检查了cssd/ocssd.log信息:


2011-05-29 05:46:20.919: [ CSSD][1108568384]clssgmHandleGrockRcfgUpdate: grock(DAALL_DB), updateseq(5), status(0), sendresp(1)


2011-05-29 05:46:20.919: [ SKGFD][1076803904]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op ioerror error I/O Error)


2011-05-29 05:46:20.919: [ CSSD][1108568384]clssgmTestSetLastGrockUpdate: grock(GR+GCR1), updateseq(4) msgseq(5), lastupdt<0xb585af0>, ignoreseq(0)


2011-05-29 05:46:20.919: [ CSSD][1076803904](:CSSNM00059:)clssnmvWriteBlocks: write failed at offset 17 of ORCL:VD1


2011-05-29 05:46:20.919: [ CSSD][1105414464]clssgmDeadProc: proc 0xb23bd10


2011-05-29 05:46:20.919: [ CSSD][1105414464]clssgmDestroyProc: cleaning up proc(0xb23bd10) con(0x4d0b) skgpid 14228 ospid 14228 with 0 clients, refcount 0


2011-05-29 05:46:20.919: [ CSSD][1105414464]clssgmDiscEndpcl: gipcDestroy 0x4d0b


2011-05-29 05:46:20.920: [ CSSD][1076803904]clssnmvDiskAvailabilityChange: voting file ORCL:VD1 now offline


2011-05-29 05:46:20.920: [ SKGFD][1097120064]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x2aaab01aa5b0 for disk :ORCL:VD1:


2011-05-29 05:46:20.919: [ CSSD][1108568384]clssgmRemoveMember: grock GR+GCR1, member number 3 (0x2aaab03886d0) node number 1 state 0x0 grock type 2


2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmQueueGrockEvent: groupName(GR+GCR1) count(3) master(0) event(1), incarn 5, mbrc 3, to member 1, events 0x280, state 0x0


2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmRPCDone: rpc 0x2aaaaaec7060 (RPC#98) state 6, flags 0x100


2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmDelMemCmpl: rpc 0x2aaaaaec7060, ret 0, client (nil) member 0x2aaab03886d0


2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmFreeRPCIndex: freeing rpc 98


2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmHandleGrockRcfgUpdate: grock(GR+GCR1), updateseq(5), status(0), sendresp(1)


2011-05-29 05:46:20.921: [ CSSD][1105414464]clssgmDeadProc: proc 0x2aaab03997f0


2011-05-29 05:46:20.921: [ CSSD][1105414464]clssgmFenceClient: fencing client (0x2aaab0399960), member 0 in group DGSPSP0, no share, death fence 1, SAGE fence 0


.


.


.


2011-05-29 20:44:54.778: [ CRSMAIN][1088713120] Checking the OCR device


2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Sync-up with OCR


2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Connecting to the CSS Daemon


2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Getting local node number


2011-05-29 20:44:54.779: [ CRSMAIN][1105021248] Policy Engine is not initialized yet!


2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Initializing OCR


[ CLWAL][1088713120]clsw_Initialize: OLR initlevel [70000]


2011-05-29 20:44:55.099: [ OCRASM][1088713120]proprasmo: Error in open/create file in dg [DG_CRS]


[ OCRASM][1088713120]SLOS : SLOS: cat=8, pn=kgfoOpenFile01, dep=15056, loc=kgfokge


2011-05-29 20:44:55.099: [ OCRASM][1088713120]ASM Error Stack :


2011-05-29 20:44:55.126: [ OCRASM][1088713120]proprasmo: kgfoCheckMount returned [6]


2011-05-29 20:44:55.126: [ OCRASM][1088713120]proprasmo: The ASM disk group DG_CRS is not found or not mounted


2011-05-29 20:44:55.127: [ OCRRAW][1088713120]proprioo: Failed to open [+DG_CRS]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.


2011-05-29 20:44:55.127: [ OCRRAW][1088713120]proprioo: No OCR/OLR devices are usable


2011-05-29 20:44:55.127: [ OCRASM][1088713120]proprasmcl: asmhandle is NULL


2011-05-29 20:44:55.127: [ GIPC][1088713120] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 687], original from [clsss.c : 5325]


2011-05-29 20:44:55.129: [ default][1088713120]clsvactversion:4: Retrieving Active Version from local storage.


2011-05-29 20:44:55.130: [ CSSCLNT][1088713120]clssgsgrppubdata: group (ocr_Oracle-cluster) not found


2011-05-29 20:44:55.130: [ OCRRAW][1088713120]proprio_repairconf: Failed to retrieve the group public data. CSS ret code [20]


2011-05-29 20:44:55.133: [ OCRRAW][1088713120]proprioo: Failed to auto repair the OCR configuration.


2011-05-29 20:44:55.133: [ OCRRAW][1088713120]proprinit: Could not open raw device


2011-05-29 20:44:55.133: [ OCRASM][1088713120]proprasmcl: asmhandle is NULL


2011-05-29 20:44:55.134: [ OCRAPI][1088713120]a_init:16!: Backend init unsuccessful : [26]


2011-05-29 20:44:55.134: [ CRSOCR][1088713120] OCR context init failure. Error: PROC-26: Error while accessing the physical storage


2011-05-29 20:44:55.134: [ CRSOCR][1088713120][PANIC] OCR Context is NULL(File: caaocr.cpp, line: 145)


2011-05-29 20:44:55.134: [ CRSD][1088713120][PANIC] CRSD Exiting. OCR Failed


2011-05-29 20:44:55.134: [ CRSD][1088713120] Done.


基于错误信息的判断,问题和存储层直接相关,如果不是磁盘损坏,就是多路径设置异常,导致问题的原因是cluster在读写VOT和OCR的ASM是出现了错误,问题应该发生在组成这个ASM磁盘组的裸设备上。


最终客户通过重建OCR和VOT解决了这个错误。


oracle视频教程请关注:http://u.youku.com/user_video/id_UMzAzMjkxMjE2.html