如下是另一次典型的由于网络心跳异常导致的RAC节点GRID软件重启分析(由于11gR2 新特性Rebootless Restart特性默认是重启GRID不重启主机,但是此问题由于当时私网IP被占用,发生了多次GRID重启,某次关闭时未正常关闭,所以触发了主机重启),当时写给用户的故障分析报告,供参考。
一、服务概述
某客户CRM系统 RAC集群节点2异常重启问题;**工程师接到故障申报后,及时进行响应,通过对相关日志等信息的深入分析,整理汇总此文档。
二、数据库集群节点2异常重启问题分析
通过查看集群日志,可以发现在2019-03-08 15:30:22开始,集群的两个节点间网络心跳异常,在2019-03-08 15:30:36.797根据集群脑裂时的机制,将集群节点2从集群中驱逐。如下是相应集群日志:
1.节点2被驱逐出集群的日志
1.1 集群节点1日志信息
节点1ALERT
节点1集群日志:
2019-03-08 15:30:22.278:
[cssd(5543)]CRS-1612:Network communication with node swcrm2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.516 seconds
2019-03-08 15:30:30.282:
[cssd(5543)]CRS-1611:Network communication with node swcrm2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.512 seconds
2019-03-08 15:30:34.284:
[cssd(5543)]CRS-1610:Network communication with node swcrm2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.510 seconds
2019-03-08 15:30:36.797:
[cssd(5543)]CRS-1607:Node swcrm2 is being evicted in cluster incarnation 433013533; details at (:CSSNM00007:) in /oracle/11.2.0/grid/log/crm1/cssd/ocssd.log.
1.2 集群节点2日志信息
节点2集群日志:
2018-09-27 01:50:21.772:
[crsd(10018)]CRS-2772:Server 'swcrm1' has been assigned to pool 'ora.crm'.
2019-03-08 15:30:30.568:
[cssd(9624)]CRS-1612:Network communication with node swcrm1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.018 seconds
2019-03-08 15:30:37.130:
[cssd(9624)]CRS-1608:This node was evicted by node 1, swcrm1; details at (:CSSNM00005:) in /oracle/11.2.0/grid/log/crm2/cssd/ocssd.log.
2019-03-08 15:30:37.131:
[cssd(9624)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/11.2.0/grid/log/crm2/cssd/ocssd.log
2019-03-08 15:30:37.132:
[cssd(9624)]CRS-1652:Starting clean up of CRSD resources..
1.3 集群节点1 CSSD进程的详细信息
节点1 CSSD进程日志:
15:30:34.284: [ CSSD][39]clssnmPollingThread: node swcrm2 (2) at 90% heartbeat fatal, removal in 2.510 seconds,
2019-03-08
…………
2019-03-08 15:30:36.796: [ CSSD][41]clssnmDoSyncUpdate: Terminating node 2, swcrm2, misstime(30000) state(5)
2019-03-08 15:30:36.796: [ CSSD][41]clssnmDoSyncUpdate: Wait for 0 vote ack(s)
2019-03-08 15:30:36.796: [ CSSD][41]clssnmCheckDskInfo: Checking disk info...
2019-03-08 15:30:36.796: [ CSSD][41]clssnmCheckSplit: Node 2, swcrm2, is alive, DHB (1552030236, 4056057127) more than disk timeout of 27000 after the last NHB (1552030206, 4056027274)
2019-03-08 15:30:36.796: [ CSSD][41]clssnmCheckDskInfo: My cohort: 1
2019-03-08 15:30:36.796: [ CSSD][41]clssnmRemove: Start
2019-03-08 15:30:36.796: [ CSSD][41](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, swcrm2, from the cluster in incarnation 433013533, node birth incarnation 433013530, death incarnation 433013533, stateflags 0x264000 uniqueness value 1537978854
2019-03-08 15:30:36.797: [ CSSD][1]clssgmQueueGrockEvent: groupName(IGCRMSYS$USERS)
…………
2019-03-08 15:30:44.814: [ CSSD][42]clssnmUpdateNodeState: node swcrm1, number 1, current state 3, proposed state 3, current unique 1537984150, proposed unique 1537984150, prevConuni 0, birth 433013532
2019-03-08 15:30:44.814: [ CSSD][42]clssnmUpdateNodeState: node swcrm2, number 2, current state 5, proposed state 0, current unique 1537978854, proposed unique 1537978854, prevConuni 1537978854, birth 433013530
2019-03-08 15:30:44.814: [ CSSD][42]clssnmDeactivateNode: node 2, state 5
2019-03-08 15:30:44.814: [ CSSD][42]clssnmDeactivateNode: node 2 (swcrm2) left cluster
1.4 集群节点2 CSSD进程的详细信息
节点2 CSSD进程日志:
2019-03-08 15:30:30.569显示与节点1网络心跳异常,磁盘心跳正常node 1, swcrm1, has a disk HB, but no network HB
2019-03-08 15:30:29.812: [ CSSD][40]clssnmSendingThread: sending status msg to all nodes
2019-03-08 15:30:29.812: [ CSSD][40]clssnmSendingThread: sent 4 status msgs to all nodes
2019-03-08 15:30:29.815: [ CSSD][5]clssgmpcBuildNodeList: nodename for node 0 is NULL
2019-03-08 15:30:30.569: [ CSSD][39]clssnmPollingThread: node swcrm1 (1) at 50% heartbeat fatal, removal in 14.018 seconds
2019-03-08 15:30:30.569: [ CSSD][39]clssnmPollingThread: node swcrm1 (1) is impending reconfig, flag 2229260, misstime 15982
2019-03-08 15:30:30.569: [ CSSD][39]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2019-03-08 15:30:30.569: [ CSSD][29]clssnmvDHBValidateNcopy: node 1, swcrm1, has a disk HB, but no network HB, DHB has rcfg 433013533, wrtcnt, 50763125, LATS 4056051052, lastSeqNo 50762918, uniqueness 1537984150, timestamp 1552030230/4135610688
2019-03-08 15:30:30.569: [ CSSD][32]clssnmvDHBValidateNcopy: node 1, swcrm1, has a disk HB, but no network HB, DHB has rcfg 433013533, wrtcnt, 50763126, LATS 4056051052, lastSeqNo 50763087, uniqueness 1537984150, timestamp 1552030230/4135611038
2019-03-08 15:30:30.570: [ CSSD][35]clssnmvDHBValidateNcopy: node 1, swcrm1, has a disk HB, but no network HB, DHB has rcfg 433013533, wrtcnt, 50763127, LATS 4056051052, lastSeqNo 50762920, uniqueness 1537984150, timestamp 1552030230/4135611056
2019-03-08 15:30:30.621: [ CSSD][30]clssnmvDiskPing: Writing with status 0x3, timestamp 1552030230/4056051104
2019-03-08 15:30:30.635: [ CSSD][33]clssnmvDiskPing: Writing with status 0x3, timestamp 1552030230/4056051118
2019-03-08 15:30:30.907: [ CSSD][5]clssscSelect: cookie accept request 100a27af0
…………
2019-03-08 15:30:37.131: [ CSSD][28](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node swcrm1, number 1, sync 433013533, stamp 4135617359
2019-03-08 15:30:37.131: [ CSSD][28]clssnmRemoveNodeInTerm: node 2, swcrm2 terminated due to Normal Shutdown. Removing from member and connected bitmaps
2019-03-08 15:30:37.131: [ CSSD][30]clssnmvDiskPing: Writing with status 0x3, timestamp 1552030237/4056057614
2019-03-08 15:30:37.131: [ CSSD][28]###################################
2019-03-08 15:30:37.131: [ CSSD][28]clssscExit: CSSD aborting from thread clssnmvKillBlockThread
2019-03-08 15:30:37.131: [ CSSD][28]###################################
2019-03-08 15:30:37.131: [ CSSD][28](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2019-03-08 15:30:37.131: [ CSSD][28]clssnmSendMeltdownStatus: node swcrm2, number 2, has experienced a failure in thread number 10 and is shutting down
2019-03-08 15:30:37.132: [ CSSD][28]clssscExit: Starting CRSD cleanup
…………开始清理CRSD资源
2019-03-08 15:30:40.767: [ CSSD][28]clssscExit: CRSD cleanup status 0
2019-03-08 15:30:40.768: [ CSSD][28]clssscExit: CRSD cleanup successfully completed
2.节点2操作系统重启分析
关于11.2集群的rebootless restart特性:
从集群版本11.2.0.2 开始,oracle新特性rebootless restart当出现以下情况的时候,集群件(GI)会重新启动集群管理软件,而不是将节点重启。
1.当某个节点连续丢失网络心跳超过misscount时。
2.当某个节点不能访问大多数表决盘(VF)时。
3.当member kill 被升级成为node kill的时候。
在之前的版本,以上情况,集群管理软件(CRS)会直接重启节点。
而GRID 在重启集群之前,首先要对集群进行graceful shutdown, 基本的步骤如下。
1.停止本地节点的所有心跳(网络心跳,磁盘心跳和本地心跳)。
2.通知cssd agent,ocssd.bin即将停止
3.停止所有注册到css的具有i/o能力的进程,例如 lmon。
4.cssd通知crsd 停止所有资源,如果crsd不能成功的停止所有的资源,节点重启仍然会发生。
5.Cssd等待所有的具有i/o能力的进程退出,如果这些进程在short i/o timeout时间内不能不能全部推迟,节点重启仍然会发生。
6.通知cssd agent 所有的有i/o能力的进程全部退出。
7.Ohasd 重新启动集群。
8.本地节点通知其他节点进行集群重配置。
特殊情况下如由于ocssd.bin 出现问题(例如:挂起),或者操作系统性能引起的重启,rebootless restart是无法起作用的,因为,对于这种情况ocssd.bin已经不能正常工作,节点重启仍然不可避免。
本次数据库集群版本为11.2.0.4.0,在由于网络心跳丢失发生节点驱逐时,正常情况下应该是只重启集群软件,不会导致重启操作系统。经分析集群节点2的CSSD进程日志可以发现,在2019-03-08 15:30分节点2被集群驱逐后,节点2的集群软件GRID又重新启动了4次,在最后一次重启前发生了主机重启。如下为详细的日志信息分析
2.1 CSSD进程的启动次数信息
2019-03-08 15:30:48.991: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.4.0, in (clustered) mode with uniqueness value 1552030248
=====>>>>>本次启动前未发生主机重启
2019-03-08 16:06:37.698: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.4.0, in (clustered) mode with uniqueness value 1552032397
=====>>>>>本次启动前未发生主机重启
2019-03-08 16:17:27.769: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.4.0, in (clustered) mode with uniqueness value 1552033047
=====>>>>>本次启动前未发生主机重启
2019-03-08 16:27:26.269: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.4.0, in (clustered) mode with uniqueness value 1552033646
=====>>>>>本次启动前发生主机重启
2.2 GRID重启前的关闭日志信息:
节点2的第一次集群关闭正常完成
2019-03-08 15:30:40.767: [ CSSD][28]clssscExit: CRSD cleanup status 0
2019-03-08 15:30:40.768: [ CSSD][28]clssscExit: CRSD cleanup successfully completed
节点2的第二次集群关闭正常完成
2019-03-08 16:06:32.101: [ CSSD][28]clssscExit: CRSD cleanup status 0
2019-03-08 16:06:32.102: [ CSSD][5]clssgmProcClientReqs: Checking RPC Q
2019-03-08 16:06:32.102: [ CSSD][5]clssgmProcClientReqs: Checking dead client Q
2019-03-08 16:06:32.102: [ CSSD][5]clssgmProcClientReqs: Checking dead proc Q
2019-03-08 16:06:32.102: [ CSSD][28]clssscExit: CRSD cleanup successfully completed
节点2的第三次集群关闭未正常完成
本次关闭集群软件时在清理CRSD资源时失败,所以出现了主机重启。
2019-03-08 16:20:29.538: [ CSSD][28]clssscExit: CRSD cleanup status 184
2019-03-08 16:20:29.538: [ CSSD][28]clssscExit: CRSD cleanup failed with 184
具体如下:
2019-03-08 16:20:29.296: [ CSSD][28]clssnmRemoveNodeInTerm: node 2, swcrm2 terminated due to Normal Shutdown. Removing from member and connected bitmaps
2019-03-08 16:20:29.296: [ CSSD][28]###################################
2019-03-08 16:20:29.296: [ CSSD][28]clssscExit: CSSD aborting from thread clssnmvKillBlockThread
2019-03-08 16:20:29.296: [ CSSD][28]###################################
2019-03-08 16:20:29.296: [ CSSD][28](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2019-03-08 16:20:29.296: [ CSSD][28]clssnmSendMeltdownStatus: node swcrm2, number 2, has experienced a failure in thread number 10 and is shutting down
2019-03-08 16:20:29.296: [ CSSD][34](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node swcrm1, number 1, sync 433013537, stamp 4138609451
2019-03-08 16:20:29.296: [ CSSD][34]clssnmRemoveNodeInTerm: node 2, swcrm2 terminated due to Normal Shutdown. Removing from member and connected bitmaps
2019-03-08 16:20:29.296: [ CSSD][34]clssscExit: abort already set 1
2019-03-08 16:20:29.296: [ CSSD][28]clssscExit: Starting CRSD cleanup
^^^^^^
2019-03-08 16:20:29.538: [ CSSD][28]clssscExit: CRSD cleanup status 184
2019-03-08 16:20:29.538: [ CSSD][28]clssscExit: CRSD cleanup failed with 184
2019-03-08 16:27:26.268: [ CSSD][1]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0
三、总结与后续处理建议
本次数据库集群在2019-03-08 15:30:40出现了节点间因为丢失网络心跳从而发生节点2被驱逐出集群。节点2之后自动重新启动集群软件,在2019-03-08 15:30:48、2019-03-08 16:06:37、2019-03-08 16:17:27时反复尝试重新启动集群,由于网络心跳异常加入集群未成功。
在重启GRID加入集群不成功后节点2会被集群的脑裂机制反复驱逐,此时由于集群版本11.2.0.2 开始的oracle新特性rebootless restart机制会不重启主机只尝试重启GRID集群软件;而在16:20:29时由于在关闭集群时未能成功清理CRSD资源,此时触发了重启操作系统的机制,导致操作系统也发生了重启。
由于本次集群间网络心跳异常原因是网络中的其它IP与当前集群私网IP地址冲突,在处理了此问题后,集群在主机重启后已经可以正常启动;目前集群状态均正常。