近期,有客户的Exadata一体机遇到存储节点重启,导致集群状态异常、Database crash。

按Exadata一体机的冗余设计,单个存储节点重启是不影响集群及数据库运行的,本次存储节点故障却导致了CSSD进程异常、数据库实例 crash宕机,属于较为严重的故障。
对本次问题的分析如下:

1.数据库实例crash宕机时信息分析

可以发现出现了IO异常,ASM中磁盘数据不一致,之后CKPT进程KILL了数据库实例。

Mon Apr 17 02:33:43 2023
LNS: Standby redo logfile selected for thread 1 sequence 158688 for destination LOG_ARCHIVE_DEST_4
Mon Apr 17 04:23:35 2023
Dumping diagnostic data in directory=[cdmp_20230417042335], requested by (instance=2, osid=45059 (LGWR)), summary=[
abnormal instance termination].
Mon Apr 17 04:23:37 2023
Exadata cell: o/10.10.11.5;10.10.11.6 is no longer accessible. I/O errors to disks on this might get suppressed
Mon Apr 17 04:23:37 2023
NOTE: process _ckpt_exhxyy1 (329639) initiating offline of disk 6.4040391157 (DATAC1_CD_00_XYDBCELADM01) with mask 
0x7e[0x7] in group 1
WARNING: failed to offline some disks corresponding to the mirrors of virtual extent 0 of file 377 in group 1 
Mon Apr 17 04:23:38 2023
NOTE: process _dbw2_exhxyy1 (329617) initiating offline of disk 16.4040391159 (DATAC1_CD_04_XYDBCELADM01) with mask
 0x7e[0x7] in group 1
Errors in file /u01/app/oracle/diag/rdbms/exhxyy/exhxyy1/trace/exhxyy1_ckpt_329639.trc:
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '+DATAC1/exhxyy/controlfile/control01.ctl'
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATAC1"
Errors in file /u01/app/oracle/diag/rdbms/exhxyy/exhxyy1/trace/exhxyy1_ckpt_329639.trc:
ORA-00221: error on write to control file
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '+DATAC1/exhxyy/controlfile/control01.ctl'
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATAC1"
CKPT (ospid: 329639): terminating the instance due to error 221
Mon Apr 17 04:23:38 2023
opiodr aborting process unknown ospid (331142) as a result of ORA-1092
Mon Apr 17 04:23:38 2023
ORA-1092 : opitsk aborting process

2.检查集群的日志分析异常情况

集群的alert.log显示CSSD进程的IO HANG,之后磁盘组OFFLINE,OCR也无法访问。

2023-04-17 02:45:38.884 [CLSECHO(39596)]AFD-0654: AFD is not supported on Exadata systems
2023-04-17 04:22:42.204 [OCSSD(321843)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread_
0 not scheduled for 8390 msecs.
2023-04-17 04:22:42.205 [OCSSD(321843)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThrea
d_0 not scheduled for 8220 msecs.
2023-04-17 04:22:50.206 [OCSSD(321843)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread_
0 not scheduled for 16390 msecs.
2023-04-17 04:22:50.206 [OCSSD(321843)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThrea
d_0 not scheduled for 16220 msecs.
2023-04-17 04:22:58.207 [OCSSD(321843)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread_
0 not scheduled for 24390 msecs.
2023-04-17 04:22:58.208 [OCSSD(321843)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvDiskPingThrea
d_0 not scheduled for 24220 msecs.
2023-04-17T04:22:59.999388+08:00
ossnet_fail_defcon: Giving up on Cell 10.10.11.5 as retry limit (9) reached.
2023-04-17 04:23:00.000 [OCSSD(321843)]CRS-1649: An I/O error occurred for voting file: o/10.10.11.5;10.10.11.6/DAT
AC1_CD_05_xydbceladm01; details at (:CSSNM00059:) in /u01/app/grid/diag/crs/xydbdb01/crs/trace/ocssd.trc.
2023-04-17T04:23:01.012702+08:00
ossnet_fail_defcon: Giving up on Cell 10.10.11.5 as retry limit (9) reached.
2023-04-17T04:23:06.018806+08:00
ossnet_fail_defcon: Giving up on Cell 10.10.11.5 as retry limit (9) reached.
2023-04-17 04:23:06.209 [OCSSD(321843)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread_
0 not scheduled for 32390 msecs.
2023-04-17T04:23:10.022766+08:00
ossnet_fail_defcon: Giving up on Cell 10.10.11.5 as retry limit (9) reached.
2023-04-17 04:23:14.210 [OCSSD(321843)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread_
0 not scheduled for 40400 msecs.
2023-04-17T04:23:15.031168+08:00
ossnet_fail_defcon: Giving up on Cell 10.10.11.5 as retry limit (9) reached.
2023-04-17T04:23:20.041501+08:00
ossnet_fail_defcon: Giving up on Cell 10.10.11.5 as retry limit (9) reached.
2023-04-17 04:23:22.211 [OCSSD(321843)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread_
0 not scheduled for 48400 msecs.
2023-04-17T04:23:25.047325+08:00
ossnet_fail_defcon: Giving up on Cell 10.10.11.5 as retry limit (9) reached.
2023-04-17T04:23:29.054162+08:00
ossnet_fail_defcon: Giving up on Cell 10.10.11.5 as retry limit (9) reached.
2023-04-17 04:23:30.212 [OCSSD(321843)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread_
0 not scheduled for 56400 msecs.
2023-04-17T04:23:34.060058+08:00
ossnet_fail_defcon: Giving up on Cell 10.10.11.5 as retry limit (9) reached.
2023-04-17T04:23:37.855759+08:00
ossnet_fail_defcon: Giving up on Cell 10.10.11.5 as retry limit (9) reached.
2023-04-17 04:23:37.843 [ORAAGENT(329109)]CRS-5011: Check of resource "nlisdb" failed: details at "(:CLSN00007:)" i
n "/u01/app/grid/diag/crs/xydbdb01/crs/trace/crsd_oraagent_oracle.trc"
2023-04-17 04:23:37.857 [OCSSD(321843)]CRS-1649: An I/O error occurred for voting file: o/10.10.11.5;10.10.11.6/DAT
AC1_CD_05_xydbceladm01; details at (:CSSNM00060:) in /u01/app/grid/diag/crs/xydbdb01/crs/trace/ocssd.trc.
2023-04-17 04:23:37.864 [ORAAGENT(329109)]CRS-5011: Check of resource "exlisdb" failed: details at "(:CLSN00007:)" 
in "/u01/app/grid/diag/crs/xydbdb01/crs/trace/crsd_oraagent_oracle.trc"
2023-04-17 04:23:37.871 [ORAAGENT(329109)]CRS-5011: Check of resource "htbase" failed: details at "(:CLSN00007:)" i
n "/u01/app/grid/diag/crs/xydbdb01/crs/trace/crsd_oraagent_oracle.trc"
2023-04-17 04:23:38.219 [OCSSD(321843)]CRS-1719: Cluster Synchronization Service daemon (CSSD) clssnmvWorkerThread_
0 not scheduled for 64400 msecs.
2023-04-17 04:23:38.825 [ORAAGENT(329109)]CRS-5011: Check of resource "exhxyy" failed: details at "(:CLSN00007:)" i
n "/u01/app/grid/diag/crs/xydbdb01/crs/trace/crsd_oraagent_oracle.trc"
2023-04-17T04:23:39.063328+08:00
ossnet_fail_defcon: Giving up on Cell 10.10.11.5 as retry limit (9) reached.
2023-04-17 04:23:39.065 [CRSD(325263)]CRS-2771: Maximum restart attempts reached for resource 'ora.DATAC1.dg'; will
 not restart.
2023-04-17 04:23:39.146 [CRSD(325263)]CRS-2769: Unable to failover resource 'ora.DATAC1.dg'.
2023-04-17 04:23:40.285 [CRSD(325263)]CRS-2771: Maximum restart attempts reached for resource 'ora.RECOC1.dg'; will
 not restart.
2023-04-17 04:23:40.286 [CRSD(325263)]CRS-2769: Unable to failover resource 'ora.RECOC1.dg'.
2023-04-17 04:23:41.328 [CRSD(325263)]CRS-1006: The OCR location  is inaccessible. Details in /u01/app/grid/diag/cr
s/xydbdb01/crs/trace/crsd.trc.
2023-04-17 04:23:41.184 [ORAROOTAGENT(320423)]CRS-5019: All OCR locations are on ASM disk groups [DATAC1], and none
 of these disk groups are mounted. Details are at "(:CLSN00140:)" in "/u01/app/grid/diag/crs/xydbdb01/crs/trace/oha
sd_orarootagent_root.trc".

3.分析storage节点的message日志

日志信息分析,是磁盘IO缓慢时触发Megaraid控制器firmware版本BUG问题;Megaraid Controller Encountered a Fatal Error and was Reset (Doc ID 2277377.1),Exadata Node Crashed at "memcpy_erms+0x6/0x10" After a Megaraid HBA Firmware Crash due to Disk Failure (Doc ID 2581914 参照文档上的介绍。

Apr 17 04:22:11 xydbceladm01 systemd: Removed slice User Slice of cellmonitor.
Apr 17 04:22:33 xydbceladm01 kernel: sd 0:2:5:0: [sdf] tag#0 task abort called for scmd(ffff880da4290c40)
Apr 17 04:22:33 xydbceladm01 kernel: sd 0:2:5:0: [sdf] tag#0 CDB: Read(16) 88 00 00 00 00 05 e2 ab e0 00 00 00 08 00 00 00
Apr 17 04:22:33 xydbceladm01 kernel: sd 0:2:5:0: task abort: FAILED scmd(ffff880da4290c40)
Apr 17 04:22:54 xydbceladm01 kernel: sd 0:2:5:0: [sdf] tag#1 task abort called for scmd(ffff880d6fe3eac0)
Apr 17 04:22:54 xydbceladm01 kernel: sd 0:2:5:0: [sdf] tag#1 CDB: Read(16) 88 00 00 00 00 01 51 df 9d d0 00 00 01 f0 00 00
Apr 17 04:22:54 xydbceladm01 kernel: sd 0:2:5:0: task abort: FAILED scmd(ffff880d6fe3eac0)
Apr 17 04:22:54 xydbceladm01 kernel: sd 0:2:5:0: target reset called for scmd(ffff880da4290c40)
Apr 17 04:22:54 xydbceladm01 kernel: sd 0:2:5:0: [sdf] tag#0 megasas: target reset FAILED!!
Apr 17 04:22:54 xydbceladm01 kernel: sd 0:2:5:0: [sdf] tag#9 Controller reset is requested due to IO timeout#012SCSI command pointer: (ffff88263f640000
)#011 SCSI host state: 5#011 SCSI
Apr 17 04:22:54 xydbceladm01 kernel: megaraid_sas 0000:5e:00.0: IO/DCMD timeout is detected, forcibly FAULT Firmware
Apr 17 04:22:57 xydbceladm01 kernel: megaraid_sas 0000:5e:00.0: waiting for controller reset to finish
Apr 17 04:22:58 xydbceladm01 kernel: megaraid_sas 0000:5e:00.0: Number of host crash buffers allocated: 512
Apr 17 04:22:58 xydbceladm01 kernel: megaraid_sas 0000:5e:00.0: Crash Dump is available,number of copied buffers: 99
Apr 17 04:22:58 xydbceladm01 kernel: megaraid_sas 0000:5e:00.0: Found FW in FAULT state, will reset adapter scsi0.
Apr 17 04:22:58 xydbceladm01 kernel: megaraid_sas 0000:5e:00.0: resetting fusion adapter scsi0.
Apr 17 04:23:02 xydbceladm01 kernel: megaraid_sas 0000:5e:00.0: waiting for controller reset to finish
Apr 17 04:39:13 xydbceladm01 kernel: Initializing cgroup subsys cpuset
Apr 17 04:39:13 xydbceladm01 kernel: Initializing cgroup subsys cpu
Apr 17 04:39:13 xydbceladm01 kernel: Initializing cgroup subsys cpuacct
Apr 17 04:39:13 xydbceladm01 kernel: Linux version 4.1.12-124.26.12.el7uek.x86_64 (mockbuild@) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16.0.3) (GCC)
 ) #2 SMP Wed May 8 22:25:03 PDT 2019
Apr 17 04:39:13 xydbceladm01 kernel: Command line: BOOT_IMAGE=/vmlinuz-4.1.12-124.26.12.el7uek.x86_64 root=/dev/md24p5 bootarea=c0d0 bootfrom=BOOT ro l
oglevel=7 panic=60 pci=noaer log_buf_len=1m nmi_watchdog=0 transparent_hugepage=never rd_NO_PLYMOUTH crashkernel=448M@128M audit=1 console=tty1 console
=ttyS0,115200n8 pciehp.pciehp_surprise=1 rd.auto=1 ifnames_skip=100 biosdevname=0 net.ifnames=0 rd.md.waitclean=1
Apr 17 04:39:13 xydbceladm01 kernel: e820: BIOS-provided physical RAM map:

问题总结:

问题出现后,通过查看计算节点、存储节点状态,存储节点重启后状态恢复正常;集群状态异常,通过人工强制关闭集群再启动,集群GRID及DATABASE等资源恢复正常,业务恢复使用。

对问题的深层分析如下:

1.存储节点1的重启问题;对日志信息分析,较大可能是磁盘IO缓慢时触发Megaraid控制器firmware版本BUG问题,导致存储节点重启;

2.存储节点1重启时,计算节点的ASM实例磁盘组DISMOUNT问题;
结合ORACLE文档分析,存储子系统在短暂的故障时被ASM实例DISMOUNT,需要人工启动的BUG较为匹配(参考MOS文档:ASM Diskgroup With Normal Redundancy Dismounts After Failure of Only One Failure Group (Oracle recommends having a Quorum Disk or Third Failrue Group) (Doc ID 2344251.1))。