如下是另一次典型的由于网络心跳异常导致的RAC节点GRID软件重启分析(由于11gR2 新特性Rebootless Restart特性默认是重启GRID不重启主机,但是此问题由于当时私网IP被占用,发生了多次GRID重启,某次关闭时未正常关闭,所以触发了主机重启),当时写给用户的故障分析报告,供参考。

一、服务概述

某客户CRM系统 RAC集群节点2异常重启问题;**工程师接到故障申报后,及时进行响应,通过对相关日志等信息的深入分析,整理汇总此文档。

 

二、数据库集群节点2异常重启问题分析

通过查看集群日志,可以发现在2019-03-08 15:30:22开始,集群的两个节点间网络心跳异常,在2019-03-08 15:30:36.797根据集群脑裂时的机制,将集群节点2从集群中驱逐。如下是相应集群日志:

1.节点2被驱逐出集群的日志

1.1 集群节点1日志信息

 

节点1ALERT

节点1集群日志:

2019-03-08 15:30:22.278:

[cssd(5543)]CRS-1612:Network communication with node swcrm2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.516 seconds

2019-03-08 15:30:30.282:

[cssd(5543)]CRS-1611:Network communication with node swcrm2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.512 seconds

2019-03-08 15:30:34.284:

[cssd(5543)]CRS-1610:Network communication with node swcrm2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.510 seconds

2019-03-08 15:30:36.797:

[cssd(5543)]CRS-1607:Node swcrm2 is being evicted in cluster incarnation 433013533; details at (:CSSNM00007:) in /oracle/11.2.0/grid/log/crm1/cssd/ocssd.log.

1.2 集群节点2日志信息

节点2集群日志:

2018-09-27 01:50:21.772:

[crsd(10018)]CRS-2772:Server 'swcrm1' has been assigned to pool 'ora.crm'.

2019-03-08 15:30:30.568:

[cssd(9624)]CRS-1612:Network communication with node swcrm1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.018 seconds

2019-03-08 15:30:37.130:

[cssd(9624)]CRS-1608:This node was evicted by node 1, swcrm1; details at (:CSSNM00005:) in /oracle/11.2.0/grid/log/crm2/cssd/ocssd.log.

2019-03-08 15:30:37.131:

[cssd(9624)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/11.2.0/grid/log/crm2/cssd/ocssd.log

2019-03-08 15:30:37.132:

[cssd(9624)]CRS-1652:Starting clean up of CRSD resources..

 

1.3 集群节点1 CSSD进程的详细信息

节点1 CSSD进程日志:

15:30:34.284: [    CSSD][39]clssnmPollingThread: node swcrm2 (2) at 90% heartbeat fatal, removal in 2.510 seconds,

2019-03-08

…………

2019-03-08 15:30:36.796: [ CSSD][41]clssnmDoSyncUpdate: Terminating node 2, swcrm2, misstime(30000) state(5)

2019-03-08 15:30:36.796: [ CSSD][41]clssnmDoSyncUpdate: Wait for 0 vote ack(s)

2019-03-08 15:30:36.796: [ CSSD][41]clssnmCheckDskInfo: Checking disk info...

2019-03-08 15:30:36.796: [ CSSD][41]clssnmCheckSplit: Node 2, swcrm2, is alive, DHB (1552030236, 4056057127) more than disk timeout of 27000 after the last NHB (1552030206, 4056027274)

2019-03-08 15:30:36.796: [ CSSD][41]clssnmCheckDskInfo: My cohort: 1

2019-03-08 15:30:36.796: [ CSSD][41]clssnmRemove: Start

2019-03-08 15:30:36.796: [ CSSD][41](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, swcrm2, from the cluster in incarnation 433013533, node birth incarnation 433013530, death incarnation 433013533, stateflags 0x264000 uniqueness value 1537978854

2019-03-08 15:30:36.797: [ CSSD][1]clssgmQueueGrockEvent: groupName(IGCRMSYS$USERS)

…………

2019-03-08 15:30:44.814: [ CSSD][42]clssnmUpdateNodeState: node swcrm1, number 1, current state 3, proposed state 3, current unique 1537984150, proposed unique 1537984150, prevConuni 0, birth 433013532

2019-03-08 15:30:44.814: [ CSSD][42]clssnmUpdateNodeState: node swcrm2, number 2, current state 5, proposed state 0, current unique 1537978854, proposed unique 1537978854, prevConuni 1537978854, birth 433013530

2019-03-08 15:30:44.814: [ CSSD][42]clssnmDeactivateNode: node 2, state 5

2019-03-08 15:30:44.814: [ CSSD][42]clssnmDeactivateNode: node 2 (swcrm2) left cluster

1.4 集群节点2 CSSD进程的详细信息

节点2 CSSD进程日志:

2019-03-08 15:30:30.569显示与节点1网络心跳异常,磁盘心跳正常node 1, swcrm1, has a disk HB, but no network HB



2019-03-08 15:30:29.812: [ CSSD][40]clssnmSendingThread: sending status msg to all nodes

2019-03-08 15:30:29.812: [ CSSD][40]clssnmSendingThread: sent 4 status msgs to all nodes

2019-03-08 15:30:29.815: [ CSSD][5]clssgmpcBuildNodeList: nodename for node 0 is NULL

2019-03-08 15:30:30.569: [ CSSD][39]clssnmPollingThread: node swcrm1 (1) at 50% heartbeat fatal, removal in 14.018 seconds

2019-03-08 15:30:30.569: [ CSSD][39]clssnmPollingThread: node swcrm1 (1) is impending reconfig, flag 2229260, misstime 15982

2019-03-08 15:30:30.569: [ CSSD][39]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)

2019-03-08 15:30:30.569: [ CSSD][29]clssnmvDHBValidateNcopy: node 1, swcrm1, has a disk HB, but no network HB, DHB has rcfg 433013533, wrtcnt, 50763125, LATS 4056051052, lastSeqNo 50762918, uniqueness 1537984150, timestamp 1552030230/4135610688

2019-03-08 15:30:30.569: [ CSSD][32]clssnmvDHBValidateNcopy: node 1, swcrm1, has a disk HB, but no network HB, DHB has rcfg 433013533, wrtcnt, 50763126, LATS 4056051052, lastSeqNo 50763087, uniqueness 1537984150, timestamp 1552030230/4135611038

2019-03-08 15:30:30.570: [ CSSD][35]clssnmvDHBValidateNcopy: node 1, swcrm1, has a disk HB, but no network HB, DHB has rcfg 433013533, wrtcnt, 50763127, LATS 4056051052, lastSeqNo 50762920, uniqueness 1537984150, timestamp 1552030230/4135611056

2019-03-08 15:30:30.621: [ CSSD][30]clssnmvDiskPing: Writing with status 0x3, timestamp 1552030230/4056051104

2019-03-08 15:30:30.635: [ CSSD][33]clssnmvDiskPing: Writing with status 0x3, timestamp 1552030230/4056051118

2019-03-08 15:30:30.907: [ CSSD][5]clssscSelect: cookie accept request 100a27af0

…………

2019-03-08 15:30:37.131: [ CSSD][28](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node swcrm1, number 1, sync 433013533, stamp 4135617359

2019-03-08 15:30:37.131: [ CSSD][28]clssnmRemoveNodeInTerm: node 2, swcrm2 terminated due to Normal Shutdown. Removing from member and connected bitmaps

2019-03-08 15:30:37.131: [ CSSD][30]clssnmvDiskPing: Writing with status 0x3, timestamp 1552030237/4056057614

2019-03-08 15:30:37.131: [ CSSD][28]###################################

2019-03-08 15:30:37.131: [ CSSD][28]clssscExit: CSSD aborting from thread clssnmvKillBlockThread

2019-03-08 15:30:37.131: [ CSSD][28]###################################

2019-03-08 15:30:37.131: [ CSSD][28](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

2019-03-08 15:30:37.131: [ CSSD][28]clssnmSendMeltdownStatus: node swcrm2, number 2, has experienced a failure in thread number 10 and is shutting down

2019-03-08 15:30:37.132: [ CSSD][28]clssscExit: Starting CRSD cleanup

…………开始清理CRSD资源

2019-03-08 15:30:40.767: [ CSSD][28]clssscExit: CRSD cleanup status 0

2019-03-08 15:30:40.768: [ CSSD][28]clssscExit: CRSD cleanup successfully completed

 

 

2.节点2操作系统重启分析

关于11.2集群的rebootless restart特性:

从集群版本11.2.0.2 开始,oracle新特性rebootless restart当出现以下情况的时候,集群件(GI)会重新启动集群管理软件,而不是将节点重启。

1.当某个节点连续丢失网络心跳超过misscount时。

2.当某个节点不能访问大多数表决盘(VF)时。

3.当member kill 被升级成为node kill的时候。

在之前的版本,以上情况,集群管理软件(CRS)会直接重启节点。

而GRID 在重启集群之前,首先要对集群进行graceful shutdown, 基本的步骤如下。

1.停止本地节点的所有心跳(网络心跳,磁盘心跳和本地心跳)。

2.通知cssd agent,ocssd.bin即将停止

3.停止所有注册到css的具有i/o能力的进程,例如 lmon。

4.cssd通知crsd 停止所有资源,如果crsd不能成功的停止所有的资源,节点重启仍然会发生。

5.Cssd等待所有的具有i/o能力的进程退出,如果这些进程在short i/o timeout时间内不能不能全部推迟,节点重启仍然会发生。

6.通知cssd agent 所有的有i/o能力的进程全部退出。

7.Ohasd 重新启动集群。

8.本地节点通知其他节点进行集群重配置。

特殊情况下如由于ocssd.bin 出现问题(例如:挂起),或者操作系统性能引起的重启,rebootless restart是无法起作用的,因为,对于这种情况ocssd.bin已经不能正常工作,节点重启仍然不可避免。

 

本次数据库集群版本为11.2.0.4.0,在由于网络心跳丢失发生节点驱逐时,正常情况下应该是只重启集群软件,不会导致重启操作系统。经分析集群节点2的CSSD进程日志可以发现,在2019-03-08 15:30分节点2被集群驱逐后,节点2的集群软件GRID又重新启动了4次,在最后一次重启前发生了主机重启。如下为详细的日志信息分析

2.1 CSSD进程的启动次数信息

2019-03-08 15:30:48.991: [    CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.4.0, in (clustered) mode with uniqueness value 1552030248

=====>>>>>本次启动前未发生主机重启



2019-03-08 16:06:37.698: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.4.0, in (clustered) mode with uniqueness value 1552032397

=====>>>>>本次启动前未发生主机重启



2019-03-08 16:17:27.769: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.4.0, in (clustered) mode with uniqueness value 1552033047

=====>>>>>本次启动前未发生主机重启





2019-03-08 16:27:26.269: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.4.0, in (clustered) mode with uniqueness value 1552033646

=====>>>>>本次启动前发生主机重启

2.2 GRID重启前的关闭日志信息:

节点2的第一次集群关闭正常完成

2019-03-08 15:30:40.767: [ CSSD][28]clssscExit: CRSD cleanup status 0

2019-03-08 15:30:40.768: [ CSSD][28]clssscExit: CRSD cleanup successfully completed



节点2的第二次集群关闭正常完成

2019-03-08 16:06:32.101: [ CSSD][28]clssscExit: CRSD cleanup status 0

2019-03-08 16:06:32.102: [ CSSD][5]clssgmProcClientReqs: Checking RPC Q

2019-03-08 16:06:32.102: [ CSSD][5]clssgmProcClientReqs: Checking dead client Q

2019-03-08 16:06:32.102: [ CSSD][5]clssgmProcClientReqs: Checking dead proc Q

2019-03-08 16:06:32.102: [ CSSD][28]clssscExit: CRSD cleanup successfully completed





节点2的第三次集群关闭未正常完成

本次关闭集群软件时在清理CRSD资源时失败,所以出现了主机重启。

2019-03-08 16:20:29.538: [ CSSD][28]clssscExit: CRSD cleanup status 184

2019-03-08 16:20:29.538: [ CSSD][28]clssscExit: CRSD cleanup failed with 184

具体如下:

2019-03-08 16:20:29.296: [ CSSD][28]clssnmRemoveNodeInTerm: node 2, swcrm2 terminated due to Normal Shutdown. Removing from member and connected bitmaps

2019-03-08 16:20:29.296: [ CSSD][28]###################################

2019-03-08 16:20:29.296: [ CSSD][28]clssscExit: CSSD aborting from thread clssnmvKillBlockThread

2019-03-08 16:20:29.296: [ CSSD][28]###################################

2019-03-08 16:20:29.296: [ CSSD][28](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

2019-03-08 16:20:29.296: [ CSSD][28]clssnmSendMeltdownStatus: node swcrm2, number 2, has experienced a failure in thread number 10 and is shutting down

2019-03-08 16:20:29.296: [ CSSD][34](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node swcrm1, number 1, sync 433013537, stamp 4138609451

2019-03-08 16:20:29.296: [ CSSD][34]clssnmRemoveNodeInTerm: node 2, swcrm2 terminated due to Normal Shutdown. Removing from member and connected bitmaps

2019-03-08 16:20:29.296: [ CSSD][34]clssscExit: abort already set 1

2019-03-08 16:20:29.296: [ CSSD][28]clssscExit: Starting CRSD cleanup

^^^^^^

2019-03-08 16:20:29.538: [ CSSD][28]clssscExit: CRSD cleanup status 184

2019-03-08 16:20:29.538: [ CSSD][28]clssscExit: CRSD cleanup failed with 184

2019-03-08 16:27:26.268: [ CSSD][1]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0

三、总结与后续处理建议

本次数据库集群在2019-03-08 15:30:40出现了节点间因为丢失网络心跳从而发生节点2被驱逐出集群。节点2之后自动重新启动集群软件,在2019-03-08 15:30:48、2019-03-08 16:06:37、2019-03-08 16:17:27时反复尝试重新启动集群,由于网络心跳异常加入集群未成功。

在重启GRID加入集群不成功后节点2会被集群的脑裂机制反复驱逐,此时由于集群版本11.2.0.2 开始的oracle新特性rebootless restart机制会不重启主机只尝试重启GRID集群软件;而在16:20:29时由于在关闭集群时未能成功清理CRSD资源,此时触发了重启操作系统的机制,导致操作系统也发生了重启。

由于本次集群间网络心跳异常原因是网络中的其它IP与当前集群私网IP地址冲突,在处理了此问题后,集群在主机重启后已经可以正常启动;目前集群状态均正常。