同事转来某客户待上线RAC集群异常宕机,需要进行分析并找到宕机原因。分析过程发现这个RAC重启有些不一样,不只是共享存储链路异常导致集群异常,还涉及了11gR2 RAC集群的新特性MemberKillEscalation,记录一下。

    对RAC集群其中一节点宕机的原因分析如下:

    1.从RAC集群各个组件的日志分析可以发现,节点1在2017-05-16 12:29:02.255开始出现OCSSD进程无法进行磁盘心跳的信息clssscMonitorThreads clssnmvDiskPingThread not scheduled for 9510 msecs,在Tue May 16 12:29:07 2017  ASM实例也检测到了IO异常并报错WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.,在Tue May 16 12:29:08 2017强制unmunt了ASM磁盘组SQL> alter diskgroup OCR dismount force /* ASM SERVER:2483636086 */ ;而节点1的OCSSD进程也在2017-05-16 12:29:34.011时失去了响应;查看节点1所在主机的系统日志,可以发现在May 16 12:29:01 jgdb1 multipathd: 8:208: mark as failed出现多路径报错、IO异常等;后续处理时,此主机已经无法登陆,是手动进行的冷启动;怀疑当时已经linux系统hang或内核崩溃,目前尚无日志能确认此问题,只能认定为操作系统失去响应。

    2.节点2的主机日志中也发现了May 16 12:29:01出现的多路径报错,且在May 16 12:33:04也有发生;同时节点2的数据库实例在Tue May 16 12:34:09 2017 发现节点1的lms进程异常,要求从数据库集群中驱逐节点1, 但是未能成功与节点1的进程通信,在Tue May 16 12:35:27 2017  这个时刻发出的远程实例驱逐member kill,本地节点的OCSSD进程的相关线程收到此请求并进行处理,但是未能在限定的30秒内完成,此时member kill升级为NODE KILL(ORACLE 11G RAC的MemberKillEscalation特性),在写入VF kill block后,仍不能驱逐节点1(节点1无法读取VF kill block ),直到节点1DHB超时后,此时认为节点1已经DOWN,在2017-05-16 12:36:27.574开始集群重新配置,并最终在2017-05-16 12:36:27.579完成集群的重新重置;此时节点2上的RAC集群在正常运行。

    3.从以上分析可以发现RAC异常异常宕掉一个节点的根本原因是共享存储链路出现问题后,节点1操作系统也随之失去响应,节点2的数据库实例进程随后触发了11gR2 RAC的MemberKillEscalation特性,并最终完成了集群重新配置,集群中只有节点2存活。
    4.存储链路出错的排查过程中,涉及了对LINUX主机上的多路径配置排查、存储设备本身及存储物理链路排查,已经发现当时存储有异常,不过这已经超出本文对RAC集群节点宕机原因的分析排查过程了就不展开了。

如下是对RAC集群相关日志的分析过程:

1.集群ALERT日志分析

节点1集群ALERT日志;

2017-05-16 12:07:23.031: 

[ctssd(12345)]CRS-2408:The clock on host test1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

2017-05-25 13:47:50.027: 

[ohasd(11569)]CRS-2112:The OLR service started on node test1.

2017-05-25 13:47:50.072: 

[ohasd(11569)]CRS-1301:Oracle High Availability Service started on node test1.

 

节点2集群ALERT日志:

可以看到节点1- test1在2017-05-16 12:35:57.570被驱逐出集群,接下来查看两个节点的CSSD进程日志确定节点被驱逐的具体原因:

2017-04-14 14:14:31.310: [cssd(11500)]CRS-1601:CSSD Reconfiguration complete. Active nodes are test1 test2 .

2017-05-16 12:35:27.068: 

[cssd(11500)]CRS-1663:Member kill issued by PID 12409 for 1 members, group DBJGNCDB. Details at (:CSSGM00044:) in /app/11.2/grid/log/test2/cssd/ocssd.log.

2017-05-16 12:35:57.570: 

[cssd(11500)]CRS-1607:Node test1 is being evicted in cluster incarnation 372874506; details at (:CSSNM00007:) in /app/11.2/grid/log/test2/cssd/ocssd.log.  

2017-05-16 12:36:27.579: 

[cssd(11500)]CRS-1601:CSSD Reconfiguration complete. Active nodes are test2 .

2017-05-25 13:49:54.695: 

[cssd(11500)]CRS-1601:CSSD Reconfiguration complete. Active nodes are test1 test2 .

2.CSSD进程日志的分析

节点1CSSD进程日志:

2017-05-16 12:29:01.254: [CSSD][1005053696]clssnmSendingThread: sending status msg to all nodes

2017-05-16 12:29:01.254: [CSSD][1005053696]clssnmSendingThread: sent 5 status msgs to all nodes

2017-05-16 12:29:02.255: [CSSD][1001899776]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 9510 msecs   --此时开始报错丢失磁盘心跳9.5秒(clssnmvDiskPingThread未工作原因是无法连接磁盘), clssnmvDiskPing作用是负责向集群表决盘发送磁盘心跳信息,也负责读取表决盘中kill block信息,以确认本地点是否需要重启。

2017-05-16 12:29:06.255: [CSSD][1005053696]clssnmSendingThread: sent 5 status msgs to all nodes

…………

2017-05-16 12:29:34.011: [CSSD][1773254400]clssscMonitorThreads clssnmvWorkerThread not scheduled for 40780 msecs    2017-05-25 13:47:58.735: [CSSD][2692044544]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0

节点2CSSD进程日志:

2017-05-16 12:28:57.269: [    CSSD][3057432320]clssnmSendingThread: sent 4 status msgs to all nodes

2017-05-16 12:29:00.254: [    CSSD][3054278400]clssscMonitorThreads clssnmvWorkerThread not scheduled for 7890 msecs

…………

2017-05-16 12:32:53.991: [    CSSD][3086796544]clssgmDiscEndpcl: gipcDestroy 0x2789863a

2017-05-16 12:32:55.358: [    CSSD][3057432320]clssnmSendingThread: sending status msg to all nodes

…………

2017-05-16 12:35:27.068: [    CSSD][3086796544]clssgmExecuteClientRequest: Member kill request from client (0x7f0bac338460)  --客户端 member kill请求,这是成员终止升级的开始

2017-05-16 12:35:27.068: [    CSSD][3086796544](:CSSGM00044:)clssgmReqMemberKill: Kill requested map 0x00000001 flags 0x2 escalate 0xffffffff

2017-05-16 12:35:27.069: [    CSSD][3052701440]clssgmMbrKillThread: Kill requested map 0x00000001 id 2 Group name DBJGNCDB flags 0x00000001 start time 0xc7a20712 end time 0xc7a27e36 time out 30500 req node 2 ----节点1需要在30S内完成终止本地LMON进程,否则会升级为NODE KILL

2017-05-16 12:35:57.569: [    CSSD][3052701440]clssgmMbrKillThread: Time up: Start time -945682670 End time -945652170 Current time -945652170 timeout 30500  --在此时,已经达到了Member kill 30秒超时的阀值

2017-05-16 12:35:57.569: [    CSSD][3052701440]clssgmMbrKillThread: Member kill request complete.

2017-05-16 12:35:57.569: [    CSSD][3052701440]clssgmMbrKillSendEvent: Missing answers or immediate escalation: Req member 1 Req node 2 Number of answers expected 0 Number of answers outstanding 1 

2017-05-16 12:35:57.569: [    CSSD][3052701440]clssgmQueueGrockEvent: groupName(DBJGNCDB) count(2) master(1) event(11), incarn 0, mbrc 0, to member 1, events 0x68, state 0x0

2017-05-16 12:35:57.569: [    CSSD][3052701440]clssgmMbrKillEsc: Escalating node 1 Member request 0x00000001 Member success 0x00000000 Member failure 0x00000000 Number left to kill 1 –member kill失败

2017-05-16 12:35:57.569: [    CSSD][3052701440]clssnmMarkNodeForRemoval: node 1, test1 marked for removal

2017-05-16 12:35:57.569: [    CSSD][3052701440]clssnmKillNode: node 1 (test1) kill initiated  --此时升级为node kill

2017-05-16 12:35:57.569: [    CSSD][3052701440]clssgmMbrKillThread: Exiting

2017-05-16 12:35:57.569: [    CSSD][3059009280]clssnmDiscHelper: test1, node(1) connection failed, endp (0x25fadf92), probe(0x3000000000), ninf->endp 0x7f0b25fadf92

…………

2017-05-16 12:35:57.578: [    CSSD][3074852608]clssnmvDiskEvict: Kill block write, file /dev/mapper/ocr3 flags 0x00010004, kill block unique 1492150359, stamp 3349315126/3349315126 --节点2向VF盘中写入了Kill block

2017-05-16 12:35:57.578: [    CSSD][3055855360]clssnmWaitOnEvictions: node 1, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1494909357, 2758275534, 2757884), seedhbimpd TRUE

2017-05-16 12:35:57.579: [    CSSD][3065341696]clssnmvDiskEvict: Kill block write, file /dev/mapper/ocr1 flags 0x00010004, kill block unique 1492150359, stamp 3349315126/3349315126

2017-05-16 12:35:57.579: [    CSSD][3055855360]clssnmWaitOnEvictions: node 1, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1494909357, 2758275534, 2757884), seedhbimpd TRUE

…………

2017-05-16 12:35:57.948: [    CSSD][3066918656]clssnmvDiskKillCheck: not evicted, file /dev/mapper/ocr1 flags 0x00000000, kill block unique 0, my unique 1478674884  ---没有驱逐成功

2017-05-16 12:35:57.953: [    CSSD][3076429568]clssnmvDiskKillCheck: not evicted, file /dev/mapper/ocr3 flags 0x00000000, kill block unique 0, my unique 1478674884

2017-05-16 12:35:57.954: [    CSSD][3071674112]clssnmvDiskKillCheck: not evicted, file /dev/mapper/ocr2 flags 0x00000000, kill block unique 0, my unique 1478674884

2017-05-16 12:35:57.980: [    CSSD][3055855360]clssnmWaitOnEvictions: node 1, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1494909357, 2758275534, 2757884), seedhbimpd TRUE

………………

2017-05-16 12:36:07.792: [    CSSD][3055855360]clssnmWaitOnEviction: node(1) exceeded graceful shutdown period, IPMI-kill allowed if configured

2017-05-16 12:36:07.792: [    CSSD][3055855360]clssnmWaitOnEviction: Node kill could not beperformed. Admin or connection validation failed

………………

2017-05-16 12:36:27.558: [    CSSD][3055855360]clssnmWaitOnEviction: Node kill could not beperformed. Admin or connection validation failed

2017-05-16 12:36:27.573: [    CSSD][3055855360]clssnmWaitOnEvictions: node 1, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1494909357, 2758275534, 2757884), seedhbimpd TRUE

2017-05-16 12:36:27.573: [    CSSD][3055855360]clssnmCheckKillStatus: Node 1, test1, down, due to observed lack of DHBs, LATS 3349315126, current time 3349345126  --节点1因为DHB超时,被认为DOWN

2017-05-16 12:36:27.573: [    CSSD][3055855360]clssnmCheckKillStatus: Node 1, test1, down, LATS(3349315126),timeout(30000)

2017-05-16 12:36:27.573: [    CSSD][3055855360]clssnmBldSendUpdate: syncSeqNo(372874506)

2017-05-16 12:36:27.573: [    CSSD][3055855360]clssnmBldSendUpdate: using msg version 4

2017-05-16 12:36:27.573: [    CSSD][3055855360]clssnmDoSyncUpdate: Sync 372874506 complete!

2017-05-16 12:36:27.573: [    CSSD][3055855360]clssnmRcfgMgrThread: Reconfig in progress...

2017-05-16 12:36:27.573: [    CSSD][3054278400]clssnmHandleUpdate: sync[372874506] src[2], msgvers 4 icin 372874484

2017-05-16 12:36:27.573: [    CSSD][3054278400]clssnmHandleUpdate: common properties are 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17

2017-05-16 12:36:27.573: [    CSSD][3054278400]clssnmUpdateNodeState: node test1, number 1, current state 5, proposed state 0, current unique 1492150359, proposed unique 1492150359, prevConuni 1492150359, birth 372874505

2017-05-16 12:36:27.574: [    CSSD][3054278400]clssnmDeactivateNode: node 1, state 5

2017-05-16 12:36:27.574: [    CSSD][3054278400]clssnmDeactivateNode: node 1 (test1) left cluster

…………

2017-05-16 12:36:27.574: [    CSSD][3169834752]clssgmStartNMMon: node 1 failed, birth (372874505, 0) (old/new)

2017-05-16 12:36:27.574: [    CSSD][3169834752]clssgmStartNMMon: node 2 active, birth 372874501

2017-05-16 12:36:27.574: [    CSSD][3169834752]clssgmUpdateEventValue: Reconfig Event  val 1, changes 16

2017-05-16 12:36:27.574: [    CSSD][3169834752]clssgmCompareSwapEventValue: changed CmInfo State  val 3, from 2, changes 54

2017-05-16 12:36:27.574: [    CSSD][3054278400]clssnmHandleAck: node test2, number 2, sent ack type 15 for wrong reconfig; ack is for reconfig 372874506 and we are on reconfig 372874507

2017-05-16 12:36:27.574: [    CSSD][3052701440]clssgmReconfigThread:  started for reconfig (372874506)   --开始集群的重新配置

2017-05-16 12:36:27.574: [    CSSD][3052701440]NMEVENT_RECONFIG [00][00][00][04]

2017-05-16 12:36:27.574: [    CSSD][3052701440]clssgmWaitOnEventValue: after HoldRequest  val 1, eval 1 waited 0

………………

2017-05-16 12:36:27.579: [    CSSD][3052701440]clssgmQueueGrockEvent: groupName(CSS_INTERNAL_NODE_GROUP) count(1) master(2) event(1), incarn 372874506, mbrc 1, to member 0, events 0x0, state 0x0

2017-05-16 12:36:27.579: [    CSSD][3052701440]clssnmNotifyReq: type (22)

2017-05-16 12:36:27.579: [    CSSD][3052701440]clssgmReconfigThread:  Completed reconfiguration for incarnation 372874506 successfully   --此时完成集群重配置RECONFIG

3.member kill升级的过程

 

节点2 DB ALERT日志分析:----节点2DB ALERT发出member kill日志

Tue May 16 00:08:32 2017
Archived Log entry 124 added for thread 2 sequence 63 ID 0xb0d5455a dest 1:
Tue May 16 02:00:00 2017
Closing Resource Manager plan via scheduler window
Clearing Resource Manager plan via parameter
Tue May 16 12:34:09 2017  --此时节点2发现节点1的LMS进程异常
LMS1 (ospid: 12417) has detected no messaging activity from instance 1
LMS1 (ospid: 12417) issues an IMR to resolve the situation
Please check LMS1 trace file for more detail.
Tue May 16 12:34:09 2017
Communications reconfiguration: instance_number 1
Tue May 16 12:34:37 2017
IPC Send timeout detected. Sender: ospid 20092 [​​[email protected]​​ (J000)]
Receiver: inst 1 binc 429480353 ospid 13393
IPC Send timeout to 1.2 inc 42 for msg type 36 from opid 43  --大量通信超时
IPC Send timeout: Terminating pid 43 osid 20092
Tue May 16 12:34:45 2017
IPC Send timeout detected. Sender: ospid 12409 [​​[email protected]​​ (LMON)]
Receiver: inst 1 binc 429480369 ospid 13385
Tue May 16 12:35:07 2017  --此时节点2发出了member kill请求,驱逐节点1的数据库实例
Evicting instance 1 from cluster
Waiting for instances to leave: 1 
Tue May 16 12:35:27 2017  --这个时刻发出的远程实例驱逐--member kill,与节点2的OCSSD日志中对应
Remote instance kill is issued with system inc 44
Remote instance kill map (size 1) : 1 
LMON received an instance eviction notification from instance 2
The instance eviction reason is 0x20000000
The instance eviction map is 1 
Tue May 16 12:35:37 2017
Waiting for instances to leave: 1 
Tue May 16 12:35:58 2017
IPC Send timeout detected. Sender: ospid 12417 [​​[email protected]​​ (LMS1)]
Receiver: inst 1 binc 429480353 ospid 13393
IPC Send timeout to 1.2 inc 42 for msg type 36 from opid 14
Tue May 16 12:36:07 2017
Waiting for instances to leave: 1 
Tue May 16 12:36:08 2017
IPC Send timeout detected. Sender: ospid 12451 [​​[email protected]​​ (MMON)]
Receiver: inst 1 binc 429480353 ospid 13393
minact-scn: got error during useg scan e:12751 usn:1
minact-scn: useg scan erroring out with error e:12751
IPC Send timeout detected. Sender: ospid 12451 [​​[email protected]​​ (MMON)]
Receiver: inst 1 binc 429480353 ospid 13393
IPC Send timeout to 1.2 inc 42 for msg type 32 from opid 29
Tue May 16 12:36:28 2017  --开始RECONFIG,是在完成集群重配置RECONFIG后开始的—与OCSSD日志对应
Reconfiguration started (old inc 42, new inc 46)

实例1的DB ALERT日志:无可用信息

Tue May 16 02:00:00 2017
Closing scheduler window
Closing Resource Manager plan via scheduler window
Clearing Resource Manager plan via parameter
Thu May 25 13:50:55 2017
Adjusting the default value of parameter parallel_max_servers
from 1280 to 970 due to the value of parameter processes (1000)
Starting ORACLE instance (normal)

节点1 ASM ALERT日志:

Tue May 16 12:29:07 2017  --集群OCSSD进程在2017-05-16 12:29:02.255:已经在报无法进行磁盘心跳,
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
Tue May 16 12:29:07 2017
NOTE: process _b000_+asm1 (18900) initiating offline of disk 0.3916019599 (OCR_0000) with mask 0x7e in group 1
NOTE: process _b000_+asm1 (18900) initiating offline of disk 2.3916019597 (OCR_0002) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
GMON checking disk modes for group 1 at 11 for pid 27, osid 18900
ERROR: no read quorum in group: required 2, found 1 disks
NOTE: checking PST for grp 1 done.
NOTE: initiating PST update: grp = 1, dsk = 0/0xe969b78f, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 1, dsk = 2/0xe969b78d, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 12 for pid 27, osid 18900
ERROR: no read quorum in group: required 2, found 1 disks
Tue May 16 12:29:07 2017
NOTE: cache dismounting (not clean) group 1/0x94094776 (OCR) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 18911, image: ​​[email protected]​​ (B001)
Tue May 16 12:29:07 2017
NOTE: halting all I/Os to diskgroup 1 (OCR)
Tue May 16 12:29:07 2017
NOTE: LGWR doing non-clean dismount of group 1 (OCR)
NOTE: LGWR sync ABA=5.57 last written ABA 5.57
WARNING: Offline for disk OCR_0000 in mode 0x7f failed.
WARNING: Offline for disk OCR_0002 in mode 0x7f failed.
Tue May 16 12:29:07 2017
NOTE: No asm libraries found in the system
Tue May 16 12:29:08 2017
kjbdomdet send to inst 2
detach from dom 1, sending detach message to inst 2
Tue May 16 12:29:08 2017
List of instances:
 1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 44)
 Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 1 invalid = TRUE 
 20 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
freeing rdom 1
Tue May 16 12:29:08 2017
WARNING: dirty detached from domain 1
NOTE: cache dismounted group 1/0x94094776 (OCR) 
SQL> alter diskgroup OCR dismount force /* ASM SERVER:2483636086 */ 
Tue May 16 12:29:08 2017
NOTE: cache deleting context for group OCR 1/0x94094776
GMON dismounting group 1 at 13 for pid 28, osid 18911
NOTE: Disk OCR_0000 in mode 0x7f marked for de-assignment
NOTE: Disk OCR_0001 in mode 0x7f marked for de-assignment
NOTE: Disk OCR_0002 in mode 0x7f marked for de-assignment
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 1
                                                                                                                                                                                                                                                                                                                                              Thu May 25 13:50:17 2017
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.

节点2 ASM ALERT日志:

Tue May 16 12:33:08 2017 ---此时确实出现了IO异常,从主机的日志及OCSSD中都可以对应2017-05-16 12:33:08.869: [    CSSD][3086796544]clssscMonitorThreads clssnmvWorkerThread not scheduled for 17100 msecs
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
Tue May 16 12:36:28 2017
Reconfiguration started (old inc 44, new inc 46)
List of instances:
 

 

4.主机层面的日志分析

 

实例1主机日志:

May 16 12:07:23 test1 CLSD: The clock on host test1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
May 16 12:29:01 test1 multipathd: 8:208: mark as failed
May 16 12:29:01 test1 multipathd: data4: remaining active paths: 3
May 16 12:29:01 test1 multipathd: 8:144: mark as failed
May 16 12:29:01 test1 multipathd: fra: remaining active paths: 3
May 16 12:29:01 test1 multipathd: 8:240: mark as failed
May 16 12:29:01 test1 multipathd: ocr1: remaining active paths: 3
May 16 12:29:01 test1 multipathd: 65:16: mark as failed
May 16 12:29:01 test1 multipathd: ocr3: remaining active paths: 3
May 16 12:29:01 test1 kernel: sd 1:0:0:0: rejecting I/O to offline device
May 16 12:29:01 test1 kernel: sd 0:0:1:4: rejecting I/O to offline device
May 16 12:29:01 test1 kernel: sd 0:0:1:4: [sdn] killing request
May 16 12:29:01 test1 kernel: sd 0:0:1:4: rejecting I/O to offline device
May 16 12:29:01 test1 kernel: sd 0:0:1:4: [sdn] Unhandled error code
May 16 12:29:01 test1 kernel: sd 0:0:1:4: [sdn]  
May 16 12:29:01 test1 kernel: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 16 12:29:01 test1 kernel: sd 0:0:1:4: [sdn] CDB: 
May 16 12:29:01 test1 kernel: Read(10): 28 00 00 00 a8 20 00 00 20 00
May 16 12:29:01 test1 kernel: end_request: I/O error, dev sdn, sector 43040
May 16 12:29:01 test1 kernel: device-mapper: multipath: Failing path 8:208.
May 16 12:29:01 test1 kernel: ------------[ cut here ]------------
May 16 12:29:01 test1 kernel: WARNING: at fs/sysfs/dir.c:536 sysfs_add_one+0xbb/0xe0()
May 16 12:29:01 test1 kernel: Hardware name: System x3650 M5: -[5462AC1]-
May 16 12:29:01 test1 kernel: sysfs: cannot create duplicate filename '/firmware/efi/vars/dump-type2-0-0-1494908941-cfc8fc79-be2e-4ddc-97f0-9f98bfe298a0'
May 16 12:29:01 test1 kernel: Modules linked in: autofs4 cpufreq_ondemand ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vfat fat dm_round_robin dm_multipath uinput coretemp acpi_cpufreq freq_table mperf kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr cdc_ether usbnet mii tg3 hwmon ptp pps_core shpchp sg acpi_pad ext4 jbd2 mbcache sr_mod cdrom sd_mod mxm_wmi aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci xhci_hcd megaraid_sas lpfc scsi_transport_fc scsi_tgt crc_t10dif wmi dm_mirror dm_region_hash dm_log dm_mod
May 16 12:29:01 test1 kernel: Pid: 4100, comm: kworker/3:2 Tainted: G        W    3.8.13-16.2.1.el6uek.x86_64 #1
May 16 12:29:01 test1 kernel: Call Trace:
May 16 12:29:01 test1 kernel: [<ffffffff8105d2af>] warn_slowpath_common+0x7f/0xc0
May 16 12:29:01 test1 kernel: [<ffffffff8105d3a6>] warn_slowpath_fmt+0x46/0x50
………………
May 16 12:29:01 test1 kernel: [<ffffffff810829a0>] ? kthread_freezable_should_stop+0x70/0x70
May 16 12:29:01 test1 kernel: ---[ end trace 829697669f49ed2e ]---
May 16 12:29:01 test1 kernel: sd 1:0:0:6: [sdz] killing request
May 16 12:29:01 test1 kernel: sd 1:0:0:6: rejecting I/O to offline device
May 16 12:29:02 test1 multipathd: 65:48: mark as failed
May 16 12:29:02 test1 multipathd: fra: remaining active paths: 2  --此后没有信息
May 25 13:47:41 test1 kernel: imklog 5.8.10, log source = /proc/kmsg started.  ---此处是LINUX系统重新启动的信息
May 25 13:47:41 test1 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="11024" x-info="http://www.rsyslog.com"] start
May 25 13:47:41 test1 kernel: Initializing cgroup subsys cpuset

实例2主机日志:

节点2在May 16 12:29:01也检测到了存储多链路的异常
May 14 03:26:01 localhost rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="10597" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
May 16 12:29:01 localhost multipathd: 8:32: mark as failed
May 16 12:29:01 localhost multipathd: data2: remaining active paths: 3
May 16 12:29:01 localhost kernel: sd 0:0:0:4: rejecting I/O to offline device
May 16 12:29:01 localhost kernel: sd 1:0:0:1: rejecting I/O to offline device
May 16 12:29:01 localhost kernel: sd 1:0:0:1: [sdc] killing request
May 16 12:29:01 localhost kernel: sd 1:0:0:1: rejecting I/O to offline device
May 16 12:29:01 localhost kernel: sd 1:0:0:2: rejecting I/O to offline device
…………
May 16 12:29:01 localhost kernel: Write(10): 2a
May 16 12:29:01 localhost kernel: Read(10): 28 00 00 08 00 04 00 00 01 00
May 16 12:29:01 localhost kernel: end_request: I/O error, dev sdh, sector 524292
May 16 12:29:01 localhost kernel: 00<4>device-mapper: multipath: Failing path 8:112.
………………
May 16 12:30:31 localhost multipathd: ocr3: sdab - tur checker reports path is down
May 16 12:30:31 localhost multipathd: data3: sdv - tur checker reports path is down
May 16 12:30:31 localhost multipathd: data4: sdw - tur checker reports path is down
………………
May 16 12:32:03 localhost multipathd: data2: sdu - tur checker reports path is down
May 16 12:32:04 localhost multipathd: data4: sde - tur checker reports path is down
May 16 12:32:04 localhost multipathd: ocr2: sdi - tur checker reports path is down
May 16 12:32:14 localhost kernel: sd 1:0:0:0: rejecting I/O to offline device
May 16 12:32:14 localhost kernel: sd 1:0:0:1: rejecting I/O to offline device
…………
May 16 12:32:44 localhost multipathd: data1: sdb - tur checker reports path is down
May 16 12:32:44 localhost multipathd: data2: sdc - tur checker reports path is down