近期,一个客户新安装的ORACLE 19C版本RAC;在重启主机测试高可用功能时,发现数据库集群无法正常启动。

根据集群的运行机制,一步步分析日志,可以发现是CSSD进程启动异常。分析日志,问题在于CRS-1726: Process failed to run in real-time priority;之后CSSD进程异常。这个问题MOS文档CRS Will Not Successfully Restart After Node Reboot (Doc ID 2720950.1)有提到,将/etc/sysctl.conf 加入kernel.sched_rt_runtime_us=-1生效;之后重启集群或重启主机,均可以恢复正常。

CSSD进程错误日志如下:

2022-05-16 19:47:06.218 [OCSSD(30237)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 30237
2022-05-16 19:47:07.292 [OCSSD(30237)]CRS-1713: CSSD daemon is started in hub mode
2022-05-16 19:47:07.338 [OCSSD(30237)]CRS-1726: Process failed to run in real-time priority. Details at (:CLSN00143:) in /u01/app/grid/diag/crs/dbm0dbadm01/crs/trace/ocssd.trc.
2022-05-16 19:47:07.338 [OCSSD(30237)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in
2022-05-16 19:47:08.339 [OCSSD(30237)]CRS-8503: Oracle Clusterware process OCSSD with operating system process ID 30237 experienced fatal signal or exception code 6.
2022-05-16T19:47:08.397403+08:00
Errors in file /u01/app/grid/diag/crs/dbm0dbadm01/crs/trace/ocssd.trc (incident=1):
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/dbm0dbadm01/crs/incident/incdir_1/ocssd_i1.trc

2022-05-16 19:47:09.463 [CSSDMONITOR(39942)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 39942
2022-05-16 19:47:09.516 [CSSDMONITOR(39942)]CRS-1726: Process failed to run in real-time priority. Details at (:CLSN00143:) in /u01/app/grid/diag/crs/dbm0dbadm01/crs/trace/ohasd_cssdmonitor_root.trc.
2022-05-16 19:47:18.720 [CSSDAGENT(40501)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 40501
2022-05-16 19:47:18.776 [CSSDAGENT(40501)]CRS-1726: Process failed to run in real-time priority. Details at (:CLSN00143:) in /u01/app/grid/diag/crs/dbm0dbadm01/crs/trace/ohasd_cssdagent_root.trc.