一个客户的11.2的RAC环境,出现了一个节点DOWN掉的问题。






这个客户手工调整了系统的时间,使得两个节点的系统时间只相差几秒,而且还设置了11.2的自动时间同步功能。但是到了下午发现一个节点上的实例DOWN掉了,而且这时两个节点的时间相差了一个小时左右。


检查挂掉的节点上的数据库alert文件:


Mon May 10 15:22:03 2010


NOTE: ASMB terminating


Errors in file /oracle/app/oracle/diag/rdbms/posp/posp2/trace/posp2_asmb_8257792.trc:


ORA-15064: communication failure with ASM instance


ORA-03113: end-of-file on communication channel


Process ID:


Session ID: 82 Serial number: 53


Errors in file /oracle/app/oracle/diag/rdbms/posp/posp2/trace/posp2_asmb_8257792.trc:


ORA-15064: communication failure with ASM instance


ORA-03113: end-of-file on communication channel


Process ID:


Session ID: 82 Serial number: 53


ASMB (ospid: 8257792): terminating the instance due to error 15064


Instance terminated by ASMB, pid = 8257792


很显然导致问题的原因是连接不到ASM实例,这比如导致数据库实例的崩溃,再检查ASM实例的错误日志:


Mon May 10 15:00:13 2010


WARNING: Elapsed time did not advance (delta = 0),


WARNING: skipping disk repair timer updates for


WARNING: offline disks


Mon May 10 15:03:35 2010


WARNING: Elapsed time did not advance (delta = 0),


WARNING: skipping disk repair timer updates for


WARNING: offline disks


Mon May 10 15:06:56 2010


WARNING: Elapsed time did not advance (delta = 0),


WARNING: skipping disk repair timer updates for


WARNING: offline disks


Mon May 10 15:10:17 2010


WARNING: Elapsed time did not advance (delta = 0),


WARNING: skipping disk repair timer updates for


WARNING: offline disks


Mon May 10 15:13:39 2010


WARNING: Elapsed time did not advance (delta = 0),


WARNING: skipping disk repair timer updates for


WARNING: offline disks


Mon May 10 15:17:00 2010


WARNING: Elapsed time did not advance (delta = 0),


WARNING: skipping disk repair timer updates for


WARNING: offline disks


Mon May 10 15:20:21 2010


WARNING: Elapsed time did not advance (delta = 0),


WARNING: skipping disk repair timer updates for


WARNING: offline disks


Mon May 10 15:22:03 2010


NOTE: ASMB process exiting, either shutdown is in progress


NOTE: or foreground connected to ASMB was killed.


Mon May 10 15:22:03 2010


NOTE: client exited [4718636]


NOTE: force a map free for map id 2


Mon May 10 15:22:04 2010


PMON (ospid: 6816236): terminating the instance due to error 481


Instance terminated by PMON, pid = 6816236


ASM实例的被中止的原因是错误481,而Oracle对于这个错误的描述为:


ORA-00481: LMON process terminated with error


Cause: The global enqueue service monitor process died


Action: Warm start instance


看来问题和CLUSTER环境有关,而且ASM的告警日志中包含了大量的警告信息。


进一步检查CLUSTER上的相关信息:


2011-05-10 14:33:46.819


[cssd(4391224)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .


2011-05-10 14:35:01.116


[cssd(4391224)]CRS-1632:Node jzdbiufo is being removed from the cluster in cluster incarnation 200459465


2011-05-10 14:35:01.144


[cssd(4391224)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc .


2011-05-10 14:35:03.192


[cssd(4391224)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .


2011-05-10 14:36:18.384


[cssd(4391224)]CRS-1612:Network communication with node jzdbiufo (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.896 seconds


2011-05-10 14:36:26.422


[cssd(4391224)]CRS-1611:Network communication with node jzdbiufo (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.859 seconds


2011-05-10 14:36:30.452


[cssd(4391224)]CRS-1610:Network communication with node jzdbiufo (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.829 seconds


2011-05-10 14:36:33.285


[cssd(4391224)]CRS-1632:Node jzdbiufo is being removed from the cluster in cluster incarnation 200459467


2011-05-10 14:36:33.312


[cssd(4391224)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc .


2011-05-10 14:54:16.181


[crsd(6422586)]CRS-2765:Resource 'ora.net1.network' has failed on server 'jzdbnc'.


2011-05-10 14:54:18.539


[/oracle/app/11.2.0/grid/bin/oraagent.bin(3014938)]CRS-5016:Process "/oracle/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbnc/agent/crsd/oraagent_grid/oraagent_grid.log"


2011-05-10 14:54:18.574


[/oracle/app/11.2.0/grid/bin/oraagent.bin(3014938)]CRS-5016:Process "/oracle/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbnc/agent/crsd/oraagent_grid/oraagent_grid.log"


2011-05-10 14:56:13.440


[crsd(6422586)]CRS-2765:Resource 'ora.net1.network' has failed on server 'jzdbnc'.


2011-05-10 14:56:15.055


[/oracle/app/11.2.0/grid/bin/oraagent.bin(3014938)]CRS-5016:Process "/oracle/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbnc/agent/crsd/oraagent_grid/oraagent_grid.log"


2011-05-10 14:56:16.179


[/oracle/app/11.2.0/grid/bin/oraagent.bin(3014938)]CRS-5016:Process "/oracle/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbnc/agent/crsd/oraagent_grid/oraagent_grid.log"


单从节点1的CLUSTER的trace信息看,似乎是网络的问题导致了另一个节点被关闭,检查节点2上的信息:


2010-05-10 14:38:20.599


[ctssd(5243264)]CRS-2404:The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /oracle/app/11.2.0/grid/log/jzdbiufo/ctssd/octssd.log.


2010-05-10 14:38:20.599


[ctssd(5243264)]CRS-2408:The clock on host jzdbiufo has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.


2010-05-10 15:11:25.818


[ctssd(5243264)]CRS-2404:The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /oracle/app/11.2.0/grid/log/jzdbiufo/ctssd/octssd.log.


2010-05-10 15:11:25.818


[ctssd(5243264)]CRS-2408:The clock on host jzdbiufo has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.


2010-05-10 15:21:46.319


[cssd(4259964)]CRS-1612:Network communication with node jzdbnc (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.926 seconds


2010-05-10 15:21:54.322


[cssd(4259964)]CRS-1611:Network communication with node jzdbnc (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.923 seconds


2010-05-10 15:21:58.348


[cssd(4259964)]CRS-1610:Network communication with node jzdbnc (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.897 seconds


2010-05-10 15:22:01.253


[cssd(4259964)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.


2010-05-10 15:22:01.254


[cssd(4259964)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log


2010-05-10 15:22:01.339


[cssd(4259964)]CRS-1652:Starting clean up of CRSD resources.


2010-05-10 15:22:01.927


[cssd(4259964)]CRS-1608:This node was evicted by node 1, jzdbnc; details at (:CSSNM00005:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.


2010-05-10 15:22:03.111


[/oracle/app/11.2.0/grid/bin/oraagent.bin(6946850)]CRS-5016:Process "/oracle/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/crsd/oraagent_grid/oraagent_grid.log"


2010-05-10 15:22:03.611


[/oracle/app/11.2.0/grid/bin/oraagent.bin(6946850)]CRS-5016:Process "/oracle/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/crsd/oraagent_grid/oraagent_grid.log"


2010-05-10 15:22:03.618


[cssd(4259964)]CRS-1654:Clean up of CRSD resources finished successfully.


2010-05-10 15:22:03.631


[cssd(4259964)]CRS-1655:CSSD on node jzdbiufo detected a problem and started to shutdown.


2010-05-10 15:22:03.722


[/oracle/app/11.2.0/grid/bin/oraagent.bin(7340044)]CRS-5822:Agent '/oracle/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:3:4} in /oracle/app/11.2.0/grid/log/jzdbiufo/agent/crsd/oraagent_oracle/oraagent_oracle.log.


2010-05-10 15:22:03.722


[/oracle/app/11.2.0/grid/bin/orarootagent.bin(6488506)]CRS-5822:Agent '/oracle/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:1:681} in /oracle/app/11.2.0/grid/log/jzdbiufo/agent/crsd/orarootagent_root/orarootagent_root.log.


2010-05-10 15:22:03.720


[/oracle/app/11.2.0/grid/bin/oraagent.bin(6946850)]CRS-5822:Agent '/oracle/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:2:9} in /oracle/app/11.2.0/grid/log/jzdbiufo/agent/crsd/oraagent_grid/oraagent_grid.log.


2010-05-10 15:22:03.927


[ohasd(4456806)]CRS-2765:Resource 'ora.crsd' has failed on server 'jzdbiufo'.


2010-05-10 15:22:05.042


[crsd(8978838)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in /oracle/app/11.2.0/grid/log/jzdbiufo/crsd/crsd.log.


2010-05-10 15:22:05.130


[ohasd(4456806)]CRS-2765:Resource 'ora.evmd' has failed on server 'jzdbiufo'.


2010-05-10 15:22:05.256


[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"


2010-05-10 15:22:05.431


[ohasd(4456806)]CRS-2765:Resource 'ora.ctssd' has failed on server 'jzdbiufo'.


2010-05-10 15:22:05.447


[ohasd(4456806)]CRS-2765:Resource 'ora.asm' has failed on server 'jzdbiufo'.


2010-05-10 15:22:05.512


[ohasd(4456806)]CRS-2765:Resource 'ora.crsd' has failed on server 'jzdbiufo'.


2010-05-10 15:22:06.490


[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.


2010-05-10 15:22:06.513


[ctssd(9765310)]CRS-2402:The Cluster Time Synchronization Service aborted on host jzdbiufo. Details at (:ctss_css_init1:) in /oracle/app/11.2.0/grid/log/jzdbiufo/ctssd/octssd.log.


2010-05-10 15:22:06.520


[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"


2010-05-10 15:22:06.550


[ohasd(4456806)]CRS-2765:Resource 'ora.cluster_interconnect.haip' has failed on server 'jzdbiufo'.


2010-05-10 15:22:06.697


[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"


2010-05-10 15:22:06.868


[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"


2010-05-10 15:22:07.038


[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"


2010-05-10 15:22:07.056


[ohasd(4456806)]CRS-2765:Resource 'ora.cssd' has failed on server 'jzdbiufo'.


2010-05-10 15:22:07.210


[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"


2010-05-10 15:22:07.381


[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"


2010-05-10 15:22:07.551


[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"


2010-05-10 15:22:16.783


[cssd(6029520)]CRS-1713:CSSD daemon is started in clustered mode


2010-05-10 15:22:23.271


[cssd(6029520)]CRS-1707:Lease acquisition for node jzdbiufo number 2 completed


2010-05-10 15:22:24.721


[cssd(6029520)]CRS-1605:CSSD voting file is online: /dev/rhdisk2; details in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.


2010-05-10 15:22:31.678


[cssd(6029520)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .


2010-05-10 15:22:58.291


[cssd(6029520)]CRS-1612:Network communication with node jzdbnc (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.741 seconds


2010-05-10 15:23:06.292


[cssd(6029520)]CRS-1611:Network communication with node jzdbnc (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.740 seconds


2010-05-10 15:23:31.691


[cssd(6029520)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log


2010-05-10 15:23:32.432


[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"


2010-05-10 15:23:32.904


[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"


2010-05-10 15:23:36.240


[cssd(7143634)]CRS-1713:CSSD daemon is started in clustered mode


2010-05-10 15:23:38.862


[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.


2010-05-10 15:23:42.191


[cssd(7143634)]CRS-1707:Lease acquisition for node jzdbiufo number 2 completed


2010-05-10 15:23:43.612


[cssd(7143634)]CRS-1605:CSSD voting file is online: /dev/rhdisk2; details in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.


2010-05-10 15:23:47.928


[cssd(7143634)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .


2010-05-10 15:24:47.942


[cssd(7143634)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log


2010-05-10 15:24:52.471


[cssd(7143640)]CRS-1713:CSSD daemon is started in clustered mode


2010-05-10 15:24:56.013


[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.


2010-05-10 15:24:58.524


[cssd(7143640)]CRS-1707:Lease acquisition for node jzdbiufo number 2 completed


2010-05-10 15:24:59.942


[cssd(7143640)]CRS-1605:CSSD voting file is online: /dev/rhdisk2; details in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.


2010-05-10 15:25:04.024


[cssd(7143640)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .


2010-05-10 15:26:04.037


[cssd(7143640)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log


2010-05-10 15:26:08.590


[cssd(9765338)]CRS-1713:CSSD daemon is started in clustered mode


2010-05-10 15:26:10.161


[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.


2010-05-10 15:26:14.713


[cssd(9765338)]CRS-1707:Lease acquisition for node jzdbiufo number 2 completed


2010-05-10 15:26:16.131


[cssd(9765338)]CRS-1605:CSSD voting file is online: /dev/rhdisk2; details in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.


2010-05-10 15:26:21.233


[cssd(9765338)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .


2010-05-10 15:27:21.247


[cssd(9765338)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log


2010-05-10 15:27:27.281


[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.


2010-05-10 15:27:31.316


[cssd(11403498)]CRS-1713:CSSD daemon is started in clustered mode


2010-05-10 15:27:37.276


[cssd(11403498)]CRS-1707:Lease acquisition for node jzdbiufo number 2 completed


2010-05-10 15:27:38.714


[cssd(11403498)]CRS-1605:CSSD voting file is online: /dev/rhdisk2; details in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.


2010-05-10 15:27:42.826


[cssd(11403498)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .


2010-05-10 15:28:42.840


[cssd(11403498)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log


2010-05-10 15:28:47.578


[cssd(6488556)]CRS-1713:CSSD daemon is started in clustered mode


2010-05-10 15:28:50.447


[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.


2010-05-10 15:28:53.652


[cssd(6488556)]CRS-1707:Lease acquisition for node jzdbiufo number 2 completed


2010-05-10 15:28:55.086


[cssd(6488556)]CRS-1605:CSSD voting file is online: /dev/rhdisk2; details in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.


2010-05-10 15:28:59.200


[cssd(6488556)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .


2010-05-10 15:29:59.214


[cssd(6488556)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log


2010-05-10 15:30:07.582


[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.


2010-05-10 15:30:09.668


[ohasd(4456806)]CRS-2771:Maximum restart attempts reached for resource 'ora.cssd'; will not restart.


错误信息比较多,但是仔细观察不难发现,Oracle一直在进行时间同步的相关操作。


随后出现了网络问题,导致节点2从CLUSTER中被踢出,当节点2上的CRS等进程尝试重启时,又由于CLUSTER的同步服务错误而异常中止。


看来导致问题的原因和11.2的时间同步有关。


oracle视频教程请关注:http://u.youku.com/user_video/id_UMzAzMjkxMjE2.html