现象: 节点宕掉后,无法重启动,需拨心跳网卡几次,方能自启动,初步判定为由于HAIP莫名故障,导致一个节点无法启动CRS 1 检查网络 [grid@gmdb1 trace]$ oifcfg iflist -p -n bond0 22.1.32.0 UNKNOWN 255.255.254.0 bond1 1.255.255.0 UNKNOWN 255.255.255.0 bond1 169.254.0.0 UNKNOWN 255.255.0.0 2 检查CRS [root@gmdb2 tmp]# crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4535: Cannot communicate with Cluster Ready Services CRS-4530: Communications failure contacting Cluster Synchronization Services daemon CRS-4534: Cannot communicate with Event Manager 3 检查ASM和HAIP无法启动: [root@gmdb2 tmp]# crsctl stat res -t -init NAME TARGET STATE SERVER STATE_DETAILS Cluster Resources ora.asm 1 ONLINE OFFLINE
ora.cluster_interconnect.haip 1 ONLINE OFFLINE 4 用mcaasttest.pl检查,并无问题: [grid@gmdb2 mcasttest]$ perl mcasttest.pl -n gmdb2,gmdb1 -i bond0,bond1 ########### Setup for node gmdb2 ########## Checking node access 'gmdb2' Checking node login 'gmdb2' Checking/Creating Directory /tmp/mcasttest for binary on node 'gmdb2' Distributing mcast2 binary to node 'gmdb2' ########### Setup for node gmdb1 ########## Checking node access 'gmdb1' Checking node login 'gmdb1' Checking/Creating Directory /tmp/mcasttest for binary on node 'gmdb1' Distributing mcast2 binary to node 'gmdb1' ########### testing Multicast on all nodes ##########

Test for Multicast address 230.0.1.0

11月 28 16:42:02 | Multicast Succeeded for bond0 using address 230.0.1.0:42000 11月 28 16:42:03 | Multicast Succeeded for bond1 using address 230.0.1.0:42001

Test for Multicast address 224.0.0.251

11月 28 16:42:04 | Multicast Succeeded for bond0 using address 224.0.0.251:42002 11月 28 16:42:05 | Multicast Succeeded for bond1 using address 224.0.0.251:42003 5 检查CSSD.LOG 2017-11-28 11:48:02.797: [ CSSD][2139567872]clssnmLocalJoinEvent: begin on node(2), waittime 193000 2017-11-28 11:48:02.797: [ CSSD][2139567872]clssnmLocalJoinEvent: set curtime (1040905644) for my node 2017-11-28 11:48:02.797: [ CSSD][2139567872]clssnmLocalJoinEvent: scanning 32 nodes 2017-11-28 11:48:02.797: [ CSSD][2139567872]clssnmLocalJoinEvent: Node gmdb1, number 1, is in an existing cluster with disk state 3 2017-11-28 11:48:02.797: [ CSSD][2139567872]clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk 2017-11-28 11:48:02.808: [ CSSD][2358462208]clssnmvDHBValidateNcopy: node 1, gmdb1, has a disk HB, but no network HB, DHB has rcfg 405549564, wrtcnt, 39931581, LATS 1040905654, lastSeqNo 39931578, uniqueness 1510056501, timestamp 1511840882/1783220964 2017-11-28 11:48:03.287: [ CSSD][2144298752]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0

2017-11-28 11:48:03.782: [ CSSD][2363209472]clssnmvDHBValidateNcopy: node 1, gmdb1, has a disk HB, but no network HB, DHB has rcfg 405549564, wrtcnt, 39931583, LATS 1040906624, 日志中有大量的无网络心跳的记录; 检查

SQL> select * from v$cluster_interconnects;

NAME IP_ADDRESS IS_ SOURCE


eth1:1 169.254.134.65 NO 发现走的HAIP,而本地的HAIP无法启动,导致CSSD启动不起来;检查CSSD的依赖关系: [root@12crac2 ~]# crsctl stat res ora.cluster_interconnect.haip -init -f NAME=ora.cluster_interconnect.haip TYPE=ora.haip.type STATE=OFFLINE TARGET=ONLINE ACL=owner:root:rw-,pgrp:oinstall:rw-,other::r--,user:grid:r-x ACTION_FAILURE_TEMPLATE= ACTION_SCRIPT= ACTIVE_PLACEMENT=0 AGENT_FILENAME=%CRS_HOME%/bin/orarootagent%CRS_EXE_SUFFIX% AUTO_START=always CARDINALITY=1 CARDINALITY_ID=0 CHECK_INTERVAL=30 CREATION_SEED=15 DEFAULT_TEMPLATE= DEGREE=1 DESCRIPTION="Resource type for a Highly Available network IP" ENABLED=0 FAILOVER_DELAY=0 FAILURE_INTERVAL=0 FAILURE_THRESHOLD=0 HOSTING_MEMBERS= ID=ora.cluster_interconnect.haip LOAD=1 LOGGING_LEVEL=1 NOT_RESTARTING_TEMPLATE= OFFLINE_CHECK_INTERVAL=0 PLACEMENT=balanced PROFILE_CHANGE_TEMPLATE= RESTART_ATTEMPTS=5 SCRIPT_TIMEOUT=60 SERVER_POOLS= START_DEPENDENCIES=hard(ora.gpnpd,ora.cssd)pullup(ora.cssd) 临时解决办法: 在确定心跳网络无法的情况下 禁用HAIP: crsctl modify res ora.cluster_interconnect.haip -attr "ENABLED=0" -init crsctl modify res ora.asm -attr "START_DEPENDENCIES='hard(ora.cssd,ora.ctssd)pullup(ora.cssd,ora.ctssd)weak(ora.drivers.acfs)', STOP_DEPENDENCIES='hard(intermediate:ora.cssd)' " -init 修改完成后,再次检查: 相关文章:MOS上 Known Issues: Grid Infrastructure Redundant Interconnect and ora.cluster_interconnect.haip (文档 ID 1640865.1) MOS上关于HAIP的BUG