停止NFS服务引发的一系列故障

1.df -h和ls -l挂住

原因是:racj1节点1有mount了另外一台机器232.100的nfs挂接点。而服务器端的nfs服务因为安全加固要求关闭nfs服务,所以在racj1节点1没有先umount掉nfs目录的情况下,直接停止了服务器端的nfs服务导致了rac节点1的挂死现象。

此时需要在开racj1窗口,然后用mount命令查看nfs挂接点情况:有服务器的nfs目录挂接点在racj1上。

然后强制fuser -ck /mnt后发现racj1连接都被断开,过来5秒后重新连接,df –h等命令正常执行。(安全的做法:这里其实应该将服务器端的nfs服务重启启动,保证客户端racj1先正常,再umount掉就不会引起后续的一系列故障了,没想到fuser -ck会引起这么大的问题

 

此时发现racj1节点1的rac群集服务当掉,数据库ora_进程也全部消失。

nfs停止 nfs停止服务_acfs

 

root尝试手动启动crs,1分钟后群集正常:

/oracle/app/11.2.0/grid/bin/crsctl start crs

nfs停止 nfs停止服务_acfs_02

此时发现/ogg目录在节点1没有正常挂起,而节点2是挂着的。OGG采用的是acfs共享群集文件系统。

尝试启动均失败。

[grid@racj1 ~]$ asmcmd
ASMCMD> volinfo -a
Diskgroup Name: OGGDG
 
         Volume Name: OGGVOL
         Volume Device: /dev/asm/oggvol-141
         State: ENABLED
         Size (MB): 409600
         Resize Unit (MB): 32
         Redundancy: UNPROT
         Stripe Columns: 4
         Stripe Width (K): 128
         Usage: ACFS
         Mountpath: /ogg
执行/oracle/app/11.2.0/grid/bin/srvctl stop filesystem -d /dev/asm/oggvol-141
此时节点2的/ogg目录也卸载。
尝试启动,但提示失败。
[root@racj1 ~]# /oracle/app/11.2.0/grid/bin/srvctl start filesystem -d /dev/asm/oggvol-141
PRCR-1079 : Failed to start resource ora.oggdg.oggvol.acfs
CRS-5016: Process "/oracle/app/11.2.0/grid/bin/acfssinglefsmount" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj1/agent/crsd/orarootagent_root//orarootagent_root.log"
CRS-5016: Process "/sbin/acfsutil" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj1/agent/crsd/orarootagent_root//orarootagent_root.log"
CRS-2674: Start of 'ora.oggdg.oggvol.acfs' on 'racj1' failed
[root@racj1 ~]# /oracle/app/11.2.0/grid/bin/srvctl stop filesystem -d /dev/asm/oggvol-141
[root@racj1 ~]# /oracle/app/11.2.0/grid/bin/srvctl start filesystem -d /dev/asm/oggvol-141
PRCR-1079 : Failed to start resource ora.oggdg.oggvol.acfs
CRS-5016: Process "/oracle/app/11.2.0/grid/bin/acfssinglefsmount" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj1/agent/crsd/orarootagent_root//orarootagent_root.log"
CRS-5016: Process "/oracle/app/11.2.0/grid/bin/acfssinglefsmount" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj2/agent/crsd/orarootagent_root//orarootagent_root.log"
CRS-5016: Process "/sbin/acfsutil" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj1/agent/crsd/orarootagent_root//orarootagent_root.log"
CRS-2674: Start of 'ora.oggdg.oggvol.acfs' on 'racj1' failed
CRS-5016: Process "/sbin/acfsutil" spawned by agent "/oracle/app/11.2.0/grid/bin/orarootagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/racj2/agent/crsd/orarootagent_root//orarootagent_root.log"
CRS-2674: Start of 'ora.oggdg.oggvol.acfs' on 'racj2' failed
[root@racj1 ~]# more

 

nfs停止 nfs停止服务_acfs_03

查看日志其实是有关键错误的,只是当时没注意:

nfs停止 nfs停止服务_nfs_04

 

[grid@racj1 ~]$ srvctl stop filesystem -d /dev/asm/oggvol-141
 [grid@racj1 ~]$ acfsutil registry -f -a /dev/asm/oggvol-141 /ogg
acfsutil registry: CLSU-00100: Operating System function: open64 failed with error data: 13
acfsutil registry: CLSU-00101: Operating System error message: Permission denied
acfsutil registry: CLSU-00103: error location: OOF_1
acfsutil registry: CLSU-00104: additional error information: open64 (/dev/asm/oggvol-141)
acfsutil registry: ACFS-03141: unable to open device /dev/asm/oggvol-141
此时怀疑权限有问题,对比节点1,2果然发现不对:
检查发现racj1节点的/dev/asm/oggvol-141的权限不对了。
[root@racj1 orarootagent_root]# ls -l /dev/asm/oggvol-141
brw------- 1 root root 251, 72193 Apr 25 09:18 /dev/asm/oggvol-141
[root@racj1 orarootagent_root]# chown root:asmadmin /dev/asm/oggvol-141
[root@racj1 orarootagent_root]# ls -l /dev/asm/oggvol-141
brw------- 1 root asmadmin 251, 72193 Apr 25 09:18 /dev/asm/oggvol-141
[root@racj1 orarootagent_root]# chmod 770 /dev/asm/oggvol-141
[root@racj1 orarootagent_root]# ls -l /dev/asm/oggvol-141
brwxrwx--- 1 root asmadmin

crsctl status resource –t检查发现ora.oggdg.oggvol.acfs是offline的。

nfs停止 nfs停止服务_nfs_05

尝试启动失败:

nfs停止 nfs停止服务_oracle_06

尝试重启acfs服务还是失败:

crsctl stop resource ora.oggdg.oggvol.acfs
crsctl start resource ora.oggdg.oggvol.acfs

nfs停止 nfs停止服务_fsck_07

尝试用root手动挂接报错:

mount -t acfs -rw /dev/asm/oggvol-141 /ogg
[root@racj1 orarootagent_root]# mount -t acfs -rw /dev/asm/oggvol-141 /ogg
mount: wrong fs type, bad option, bad superblock on /dev/asm/oggvol-141,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
dmesg | tail  or so
 
mount.acfs: CLSU-00100: Operating System function: mount failed with error data: 22
mount.acfs: CLSU-00101: Operating System error message: Invalid argument
mount.acfs: CLSU-00103: error location: MOUNT_3
mount.acfs: ACFS-02126: Volume /dev/asm/oggvol-141 cannot be mounted.
 
使用dmesg查看,发现关键提示:
ACFSK-0021: FSCK-NEEDED set for volume /dev/asm/oggvol-141 . Internal ACFS Location 838 .

nfs停止 nfs停止服务_fsck_08

根据提示执行fsck命令成功:

[root@racj1 ~]# /sbin/fsck -a -v -y -t acfs /dev/asm/oggvol-141
[root@racj1 ~]# su - grid
[grid@racj1 ~]$ crsctl start resource ora.oggdg.oggvol.acfs

nfs停止 nfs停止服务_acfs_09

目录挂接成功后,继续启动ogg的操作。

2.ogg启动报错丢失归档: 

nfs停止 nfs停止服务_nfs_10

view report gdcq查看报错:

[/ogg/12c/extract(ggs::gglib::MultiThreading::MainThread::ExecMain()+0x60) [0x752c80]]
                          : [/ogg/12c/extract(ggs::gglib::MultiThreading::Thread::RunThread(ggs::gglib::MultiThreading::Thread::ThreadArgs*)+0x14d) [0x753d5d]]
                          : [/ogg/12c/extract(ggs::gglib::MultiThreading::MainThread::Run(int, char**)+0xb1) [0x753e41]]
                          : [/ogg/12c/extract(main+0x3b) [0x6eff1b]]
                          : [/lib64/libc.so.6(__libc_start_main+0xfd) [0x3396a1ed1d]]
                          : [/ogg/12c/extract() [0x69aed1]]
 
2019-04-25 10:12:39  ERROR   OGG-00446  Opening file +ARCHDG/2_5183_986573398.dbf in DBLOGREADER mode: (308) ORA-00308: cannot open archived log '+ARCHDG/2_5183_986573398.dbf'
ORA-17503: ksfdopn:2 Failed to open file +ARCHDG/2_5183_986573398.dbf
ORA-15173: entry '2_5183_986573398.dbf' does not exist in directory '/'
Not able to establish initial position for sequence 5183, rba 1626514448.
 
2019-04-25 10:12:39  ERROR   OGG-01668  PROCESS ABENDING.

 

由于当前部署了每4小时备份一次归档到带库,然后删除的策略。导致ogg恢复的时候刚好归档没了。

2.1检查当前在线日志和归档日志情况:

[root@racj1 ~]# su - grid
[grid@racj1 ~]$ asmcmd
ASMCMD> ls     
ARCHDG/
CRSDG/
DATADG/
OGGDG/
ASMCMD> cd archdg
ASMCMD> ls
GDDB/
ASMCMD> cd gddb
ASMCMD> ls
ARCHIVELOG/
ASMCMD> cd archivelog
ASMCMD> ls
2019_04_25/
ASMCMD> cd 2019*
ASMCMD> ls
thread_1_seq_7409.702.1006510689
ASMCMD> ls
thread_1_seq_7409.702.1006510689
 
SQL> set line 132 wrap off
SQL> select * from v$Log;
truncating (as requested) before column NEXT_CHANGE#
 
 
    GROUP#    THREAD#  SEQUENCE#      BYTES  BLOCKSIZE    MEMBERS ARC STATUS           FIRST_CHANGE# FIRST_TIME          NEXT_TIME
---------- ---------- ---------- ---------- ---------- ---------- --- ---------------- ------------- ------------------- -----------
         1          1       7409 2147483648        512          1 YES ACTIVE              1.5742E+13 2019-04-25 09:56:19 2019-04-25
         2          1       7407 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 09:13:43 2019-04-25
         3          1       7410 2147483648        512          1 NO  CURRENT             1.5742E+13 2019-04-25 10:18:08
         4          1       7408 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 09:19:05 2019-04-25
         5          1       7406 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 08:37:07 2019-04-25
         6          2       5193 2147483648        512          1 NO  CURRENT             1.5742E+13 2019-04-25 09:56:17
         7          2       5189 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 08:37:04 2019-04-25
         8          2       5190 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 08:52:05 2019-04-25
         9          2       5191 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 09:17:26 2019-04-25
        10          2       5192 2147483648        512          1 YES INACTIVE            1.5742E+13 2019-04-25 09:47:27 2019-04-25
 
10 rows selected.

2.2登录rac节点2检查nbu备份,并计划恢复丢失的归档

 

/usr/openv/netbackup/bin/bplist -S 'nbujxq' -C 'racj2' -t 4 -R -l /
 
 
-rw-rw---- oracle    asmadmin     5052160K 4月 25 09:56 /al_2879_1_1006509383
-rw-rw---- oracle    asmadmin     8343552K 4月 25 09:56 /al_2878_1_1006509383
-rw-rw---- oracle    asmadmin     7269376K 4月 25 09:56 /al_2877_1_1006509383
-rw-rw---- oracle    asmadmin     8310016K 4月 25 09:56 /al_2876_1_1006509383
 
RUN {
allocate channel D1 type 'sbt_tape' parms 'SBT_LIBRARY=/usr/openv/netbackup/bin/libobk.so64';
allocate channel D2 type 'sbt_tape' parms 'SBT_LIBRARY=/usr/openv/netbackup/bin/libobk.so64';
send 'NB_ORA_SERV=nbujxq,NB_ORA_CLIENT=racj2';
restore archivelog from logseq 5183 until logseq 5193 thread 2;
restore archivelog from logseq 7390 until logseq 7408 thread 1;
RELEASE CHANNEL D1;
RELEASE CHANNEL D2;
}

 

nfs停止 nfs停止服务_fsck_11

nfs停止 nfs停止服务_nfs停止_12

2.3检查恢复情况:

nfs停止 nfs停止服务_nfs_13

nfs停止 nfs停止服务_fsck_14

2.4启动OGG抽取

nfs停止 nfs停止服务_acfs_15

nfs停止 nfs停止服务_fsck_16

3.后续建议:

3.1关闭nfs服务前,应先检查哪些机器挂接了目录showmount -a,先umount掉。

3.2出现问题需冷静仔细查看输出的日志,包括操作系统日志。

3.3对于有ogg的服务器,归档日志不建议快速备份删除,一般因保留1天后可备份删除,避免丢失需要大量的恢复时间。