数据库:oracle 11.2.0.4

系统:CentOS 7.9

环境:rac (2节点)

实验目标:模拟not all endpoints registered异常,以及提供解决方案.

1、数据库信息

1.1、IP 信息

[root@hisdb1 bin]# ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.133.160 netmask 255.255.255.0 broadcast 192.168.133.255
inet6 fe80::cb50:11ab:ea13:255c prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:3d:aa:80 txqueuelen 1000 (Ethernet)
RX packets 4823 bytes 576628 (563.1 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3797 bytes 672394 (656.6 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

ens33:1:flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.133.162 netmask 255.255.255.0 broadcast 192.168.133.255
ether 00:0c:29:3d:aa:80 txqueuelen 1000 (Ethernet)

ens37: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.11.125 netmask 255.255.255.0 broadcast 192.168.11.255
inet6 fe80::4ff:774a:4d4c:50d3 prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:3d:aa:8a txqueuelen 1000 (Ethernet)
RX packets 468507 bytes 343799561 (327.8 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 401829 bytes 258994097 (246.9 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

ens37:1:flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 169.254.59.33 netmask 255.255.0.0 broadcast 169.254.255.255
ether 00:0c:29:3d:aa:8a txqueuelen 1000 (Ethernet)

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 80599 bytes 46592307 (44.4 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 80599 bytes 46592307 (44.4 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255
ether 52:54:00:90:46:e7 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

说明:192.168.133.162该vip后面作为测试IP,特别注意.

[oracle@hisdb1
~]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.133.160 hisdb1
#Public IP (ens33)
192.168.133.160 hisdb1
192.168.133.161 hisdb2

#Private IP (ens37)
192.168.11.125 hisdb1-priv
192.168.11.126 hisdb2-priv

#Virtual IP
192.168.133.162 hisdb1-vip
192.168.133.163 hisdb2-vip

#Scan IP
192.168.133.164 hisdb-scan

1.2、登陆验证

[oracle@hisdb1 ~]$ sqlplus sys/oracle_4U@192.168.133.162:1521/pacs as sysdba
[oracle@hisdb2 ~]$ sqlplus sys/oracle_4U@192.168.133.162:1521/pacs as sysdba

说明:两个节点使用hisdb1-vip均能登陆数据库.

1.3、集群状态

[grid@hisdb1:/home/grid]$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
ONLINE ONLINE hisdb1
ONLINE ONLINE hisdb2
ora.LISTENER.lsnr
ONLINE ONLINE hisdb1
ONLINE ONLINE hisdb2
ora.OCR.dg
ONLINE ONLINE hisdb1
ONLINE ONLINE hisdb2
ora.asm
ONLINE ONLINE hisdb1 Started
ONLINE ONLINE hisdb2 Started
ora.gsd
OFFLINE OFFLINE hisdb1
OFFLINE OFFLINE hisdb2
ora.net1.network
ONLINE ONLINE hisdb1
ONLINE ONLINE hisdb2
ora.ons
ONLINE ONLINE hisdb1
ONLINE ONLINE hisdb2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE hisdb2
ora.cvu
1 ONLINE ONLINE hisdb1
ora.hisdb1.vip
1 ONLINE ONLINE hisdb1
ora.hisdb2.vip
1 ONLINE ONLINE hisdb2
ora.oc4j
1 ONLINE ONLINE hisdb1
ora.orcl.db
1 ONLINE ONLINE hisdb1 Open
2 ONLINE ONLINE hisdb2 Open
ora.pacs.db
1 ONLINE ONLINE hisdb1 Open
2 ONLINE ONLINE hisdb2 Open
ora.scan1.vip
1 ONLINE ONLINE hisdb2

说明:此处注意scan ip在hisdb2上.

1.4、vip状态

查看vip 192.168.133.162状态

[root@hisdb1 bin]# ./srvctl status vip -n hisdb1
VIP hisdb1-vip is enabled
VIP hisdb1-vip is running on node: hisdb1
[root@hisdb1 bin]# pwd
/u01/app/11.2.0/grid/bin

2、模拟异常

2.1、关闭vip

现在关闭vip 192.168.133.162.

[root@hisdb1 bin]# ./srvctl stop vip -n hisdb1 -f
[root@hisdb1 bin]# ./srvctl status vip -n hisdb1
VIP hisdb1-vip is enabled
VIP hisdb1-vip is not running

说明:关闭hisdb1的vip后,ifconfig不会显示ens33:1信息.

2.2、集群状态

关闭hisdb1-vip后查看集群状态.

[grid@hisdb1:/home/grid]$
crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
ONLINE ONLINE hisdb1
ONLINE ONLINE hisdb2
ora.LISTENER.lsnr
OFFLINE OFFLINE hisdb1
ONLINE ONLINE hisdb2
ora.OCR.dg
ONLINE ONLINE hisdb1
ONLINE ONLINE hisdb2
ora.asm
ONLINE ONLINE hisdb1 Started
ONLINE ONLINE hisdb2 Started
ora.gsd
OFFLINE OFFLINE hisdb1
OFFLINE OFFLINE hisdb2
ora.net1.network
ONLINE ONLINE hisdb1
ONLINE ONLINE hisdb2
ora.ons
ONLINE ONLINE hisdb1
ONLINE ONLINE hisdb2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE hisdb2
ora.cvu
1 ONLINE ONLINE hisdb1
ora.hisdb1.vip
1 OFFLINE OFFLINE

ora.hisdb2.vip
1 ONLINE ONLINE hisdb2
ora.oc4j
1 ONLINE ONLINE hisdb1
ora.orcl.db
1 ONLINE ONLINE hisdb1 Open
2 ONLINE ONLINE hisdb2 Open
ora.pacs.db
1 ONLINE ONLINE hisdb1 Open
2 ONLINE ONLINE hisdb2 Open
ora.scan1.vip
1 ONLINE ONLINE hisdb2

说明:此时节点1的监听状态以及vip变为OFFLINE.而且通过节点1的公有IP、VIP均不能登陆数据库. 通过节点2的公有ip、vip、scan ip连接无异常.

[oracle@hisdb1 ~]$ sqlplus sys/oracle_4U@192.168.133.162:1521/pacs as sysdba
报错:
ORA-12543: TNS:destination host unreachable
ORA-01017: invalid username/password; logon denied
SP2-0157: unable to CONNECT to ORACLE after 3 attempts, exiting SQL*Plus
[oracle@hisdb1 ~]$ sqlplus sys/oracle_4U@192.168.133.160:1521/pacs as sysdba
报错:ORA-12541:TNS:no listener

2.3、解决方案

2.3.1、试错方案

如上发现监听状态为off,rac环境用oracle用户启动监听.说明:scan ip在节点2,才会出现以下情况.

scan ip在节点1的情形笔者测试后未保存数据,不过你也可以自己测试后在评论区留言.

[oracle@hisdb1 ~]$ lsnrctl start

LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 22-JUL-2022 13:24:22

Copyright (c) 1991, 2013, Oracle. All rights reserved.

Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for Linux: Version 11.2.0.4.0 - Production
Start Date 22-JUL-2022 13:24:05
Uptime 0 days 0 hr. 0 min. 17 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener
Log File /u01/app/oracle/diag/tnslsnr/hisdb1/listener/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=hisdb1)(PORT=1521)))
The listener supports no services
The command completed successfully

数分钟后发现无服务,即使alter system register重新注册监听,依然无服务.

grid用户查看监听情况如下,监听出现异常.

[grid@hisdb1 ~]$ lsnrctl status

LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 22-JUL-2022 13:26:00

Copyright (c) 1991, 2013, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))
TNS-12541:
TNS:no listener
TNS-12560: TNS:protocol adapter error
TNS-00511: No listener
Linux Error: 2: No such file or directory

此时用grid用户启动监听.

[grid@hisdb1 ~]$ lsnrctl start

LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 22-JUL-2022 13:26:25

Copyright (c) 1991, 2013, Oracle. All rights reserved.

Starting /u01/app/11.2.0/grid/bin/tnslsnr: please wait...

TNSLSNR for Linux: Version 11.2.0.4.0 - Production
System parameter file is /u01/app/11.2.0/grid/network/admin/listener.ora
Log messages written to /u01/app/grid/diag/tnslsnr/hisdb1/listener/alert/log.xml
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for Linux: Version 11.2.0.4.0 - Production
Start Date 22-JUL-2022 13:26:25
Uptime 0 days 0 hr. 0 min. 0 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener
Parameter File /u01/app/11.2.0/grid/network/admin/listener.ora
Listener
Log File /u01/app/grid/diag/tnslsnr/hisdb1/listener/alert/log.xml
Listening Endpoints Summary...

(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))
Services Summary...
Service "orcl" has 1 instance(s).
Instance "orcl", status UNKNOWN, has 1 handler(s) for this service...
The command completed successfully

查看集群状态,就发现Not All Endpoints Registered报错,如下所示,此报错成功模拟出来.

该报错也是笔者在一次巡检过程中遇到,想必客户当时就是用oracle用户启动监听后发现状态不对,然后再用grid用户启动监听,查看监听状态如上就以为恢复正常,此处笔者将问题重现后并特别测试使用节点1的公有IP和VIP依然能连接数据库,所以这就导致让客户认为数据库监听已恢复正常.

[grid@hisdb1 ~]$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
ONLINE ONLINE hisdb1
ONLINE ONLINE hisdb2
ora.LISTENER.lsnr
ONLINE INTERMEDIATE hisdb1 Not All Endpoints R
egistered
ONLINE ONLINE hisdb2
ora.OCR.dg
ONLINE ONLINE hisdb1
ONLINE ONLINE hisdb2
ora.asm
ONLINE ONLINE hisdb1 Started
ONLINE ONLINE hisdb2 Started
ora.gsd
OFFLINE OFFLINE hisdb1
OFFLINE OFFLINE hisdb2
ora.net1.network
ONLINE ONLINE hisdb1
ONLINE ONLINE hisdb2
ora.ons
ONLINE ONLINE hisdb1
ONLINE ONLINE hisdb2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE hisdb2
ora.cvu
1 ONLINE ONLINE hisdb1
ora.hisdb1.vip
1 ONLINE ONLINE hisdb1
ora.hisdb2.vip
1 ONLINE ONLINE hisdb2
ora.oc4j
1 ONLINE ONLINE hisdb1
ora.orcl.db
1 ONLINE ONLINE hisdb1 Open
2 ONLINE ONLINE hisdb2 Open
ora.pacs.db
1 ONLINE ONLINE hisdb1 Open
2 ONLINE ONLINE hisdb2 Open
ora.scan1.vip
1 ONLINE ONLINE hisdb2

该试错方案的解决办法:

[grid@hisdb1 ~]$ ps -ef|grep tns
root 32 2 0 10:28 ? 00:00:00 [netns]
oracle 10097 1 0 13:24 ? 00:00:00 /u01/app/oracle/product/11.2.0/db_1/bin/tnslsnr LISTENER -inherit
grid 10230 1 0 13:26 ? 00:00:00 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER -inherit
grid 10297 10169 0 13:27 pts/0 00:00:00 grep --color=auto tns
[oracle@hisdb2:/home/oracle]$ ps -ef |grep tns
root 32 2 0 10:28 ? 00:00:00 [netns]
grid 3386 1 0 10:30 ? 00:00:00 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER -inherit
grid 3432 1 0 10:30 ? 00:00:00 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN1 -inherit
oracle 8445 5023 0 13:28 pts/0 00:00:00 grep --color=auto tns

说明:发现hisdb1存在两个监听,数据库的ORACLE_HOME目录下也有一个,导致把端口1521占住,kill进程10097,此后监听恢复正常.

2.3.2、正确方案

如前面所看到的,监听没有起来,此时应在hisdb1用grid用户启动监听,就不会出现以上异常,而且vip也随着监听开启.

[grid@hisdb1:/home/grid]$ lsnrctl start
[root@hisdb1 bin]# ./srvctl status vip -n hisdb1
VIP hisdb1-vip is enabled
VIP hisdb1-vip is running on node: hisdb1