实验环境下11204的RAC环境,出现了一个节点DOWN掉的问题。检查日志信息后,在otcssd日志信息发现如下信息:

2016-01-17 23:15:20.564: [    CTSS][1175029504]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [3].

2016-01-17 23:15:20.564: [    CTSS][1175029504]ctsscomm_recv_cb4_2: Receive active version change msg. Old active version [186647552] New active version [186647552].

2016-01-17 23:15:20.564: [    CTSS][1175029504]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].

2016-01-17 23:15:20.564: [    CTSS][1175029504]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].

2016-01-17 23:15:20.564: [    CTSS][1168725760]ctssslave_swm2_3: Received time sync message from master.

2016-01-17 23:15:20.565: [    CTSS][1168725760]ctssslave_swm: sendtime{sec[1453043718], usec[550689]}, receivetime{sec[1453043720], usec[564960]}.

2016-01-17 23:15:20.565: [    CTSS][1168725760]ctssslave_swm: The RTT of sync msg [2014271] is too large for time sync to be accurate. Recommends retry. Returns [17].

2016-01-17 23:15:20.565: [    CTSS][1168725760]ctssslave_swm: Received from master (mode [0x8c] nodenum [1] hostname [jason1] )

2016-01-17 23:15:20.565: [    CTSS][1168725760]ctsselect_monitor_steysync_mode: Failed in clsctssslave_sync_with_master [17]. Retries [0/3]. 

2016-01-17 23:15:20.565: [    CTSS][1168725760]ctssslave_swm1_1: Waiting for last time sync process to finish. sync_state[6].

2016-01-17 23:15:20.565: [    CTSS][1175029504]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler

2016-01-17 23:15:20.565: [    CTSS][1168725760]ctssslave_swm1_2: Ready to initiate new time sync process.

2016-01-17 23:15:20.565: [    CTSS][1168725760]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].

2016-01-17 23:15:20.566: [    CTSS][1175029504]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].

2016-01-17 23:15:20.566: [    CTSS][1175029504]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].

2016-01-17 23:15:20.566: [    CTSS][1168725760]ctssslave_swm2_3: Received time sync message from master.

2016-01-17 23:15:20.566: [    CTSS][1168725760]ctssslave_swm: The magnitude [733548803120 usec] of the offset [733548803120 usec] is larger than [86400000000 usec] sec which is the CTSS limit

.

2016-01-17 23:15:20.566: [    CTSS][1168725760]ctsselect_monitor_steysync_mode: Failed in clsctssslave_sync_with_master [12]: Time offset is too much to be corrected

2016-01-17 23:15:20.566: [    CTSS][1175029504]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler

2016-01-17 23:15:21.287: [    CTSS][1190360832]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0xd0], offset[733548803 ms]}, length=[8].

2016-01-17 23:15:21.287: [    CTSS][1168725760]ctsselect_monitor_steysync_mode: CTSS daemon exiting [12].

2016-01-17 23:15:21.287: [    CTSS][1168725760]CTSS daemon aborting

2016-01-17 23:15:22.290: [    CTSS][1190360832]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0xd0], offset[733548803 ms]}, length=[8].

查看两台服务器时间如下:

jason1:~ # date

Sat Jan  9 11:37:18 CST 2016

jason2:~ # date date

Sun Jan 17 23:23:12 CST 2016

两台服务器时间相差8天,Oracle的时间调整限制是1天。时间相差8天,远远超过Oracle时间同步服务允许的最大限制。因此其中一个节点被踢出了CLUSTER,由于时间同步的问题,导致了节点重启后试图再次加入到集群中报错。因此调整两台服务器时间一致,就可以解决节点DOWN的问题。首先关闭集群,然将两节点时间调整当前时间保持一致,再次启动集群或者重新启动两台服务器,问题解决。

参考:http://blog.itpub.net/4227/viewspace-695164/