Cluvfy工具
Cluvfy(cluster verify),简称CVU,是随着Oracle集群管理软件一起发布的检查工具,它的作用是对整个集群系统实施的各个阶段以及各个组件进行检查,并验证是否满足Oracle的要求。cluvfy能够对集群提供非常广泛的检查,包括OS硬件检查,内核参数设置,用户资源设置,网络设置,NTP设置,RAC组件健康检查等。cluvfy在进行检查时候并不会修改系统的配置,所以不会对系统造成影响。cluvfy已经和RAC很多配置工具集成在一起了,例如: OUI在安装GI时候的步骤14(运行前的pre-request检查)就会调用cluvfy进行安装集群之前的检查,具体如下图所示。而在GI安装的最后一步,也会调用cluvfy进行集群安装后的检查。cluvfy检查的内容可以分为两个角度:阶段(stage),组件(component)。
1.阶段
Cluvfy把部署RAC整个系统的整个过程划分为了若干个阶段,例如:在安装GI之前需要完成对OS,网络和存储的准备工作,准备工作结束后,接下来就可以进行对GI的安装了,那么可以认为这个阶段就是GI安装前阶段。
在进入下一个阶段之前,或者某一个阶段完成之后,cluvfy都会针对每个阶段的特点提供相应的阶段性检查,以便用户能够确认系统是否准备好进入下一个阶段,或者当前阶段工作是否已经完成,下面的命令列出了cluvfy所支持的所有阶段性检查。(GI安装之前和之后都会进行阶段检查)
[grid@RAC1 ~]$ cluvfy stage -list
USAGE:
cluvfy stage {-pre|-post} <stage-name> <stage-specific options> [-verbose]
Valid Stages are:
-pre cfs : pre-check for CFS setup
-pre crsinst : pre-check for CRS installation
-pre acfscfg : pre-check for ACFS Configuration.
-pre dbinst : pre-check for database installation
-pre dbcfg : pre-check for database configuration
-pre hacfg : pre-check for HA configuration
-pre nodeadd : pre-check for node addition.
-post hwos : post-check for hardware and operating system
-post cfs : post-check for CFS setup
-post crsinst : post-check for CRS installation
-post acfscfg : post-check for ACFS Configuration.
-post hacfg : post-check for HA configuration
-post nodeadd : post-check for node addition.
-post nodedel : post-check for node deletion.
以下cluvfy命令会进行安装集群之前的系统检查。
[grid@RAC1 ~]$ cluvfy stage -pre crsinst -n RAC1,RAC2 -r 11gR2 -verbose
其中:
-n选项代表需要检查节点的列表。这里需要所有列出的节点之间的用户等效性配置成功。
-r表示需要安装的软件版本。支持的版本有10gR1,10gR2,11gR1,11gR2。
-verbose表示列出检查内容详细信息
[grid@RAC1 ~]$ cluvfy stage -pre crsinst -n RAC1,RAC2 -r 11gR2 -verbose
Performing pre-checks for cluster services setup
Checking node reachability...
Checking hosts config file...
Node Name Status
------------------------------------ ------------------------
RAC2 passed
RAC1 passed
Verification of the hosts config file successful
能够看到cluvfy对两个节点的/etc/hosts文件进行了验证。
Check: Node connectivity for interface "eth0"
Source Destination Connected?
------------------------------ ------------------------------ ----------------
RAC2[192.168.56.101] RAC1[192.168.56.100] yes
RAC2[192.168.56.101] RAC1[192.168.56.103] yes
RAC2[192.168.56.101] RAC1[192.168.56.12] yes
RAC2[192.168.56.101] RAC1[192.168.56.11] yes
RAC1[192.168.56.100] RAC1[192.168.56.103] yes
RAC1[192.168.56.100] RAC1[192.168.56.12] yes
RAC1[192.168.56.100] RAC1[192.168.56.11] yes
RAC1[192.168.56.103] RAC1[192.168.56.12] yes
RAC1[192.168.56.103] RAC1[192.168.56.11] yes
RAC1[192.168.56.12] RAC1[192.168.56.11] yes
Result: Node connectivity passed for interface "eth0"
Check: TCP connectivity of subnet "192.168.56.0"
Source Destination Connected?
------------------------------ ------------------------------ ----------------
RAC1:192.168.56.100 RAC2:192.168.56.101 passed
RAC1:192.168.56.100 RAC1:192.168.56.103 passed
RAC1:192.168.56.100 RAC1:192.168.56.12 passed
RAC1:192.168.56.100 RAC1:192.168.56.11 passed
Result: TCP connectivity check passed for subnet "192.168.56.0"
Check: Node connectivity for interface "eth1"
Source Destination Connected?
------------------------------ ------------------------------ ----------------
RAC2[10.10.10.2] RAC1[10.10.10.1] yes
Result: Node connectivity passed for interface "eth1"
Check: TCP connectivity of subnet "10.10.10.0"
Source Destination Connected?
------------------------------ ------------------------------ ----------------
RAC1:10.10.10.1 RAC2:10.10.10.2 passed
Result: TCP connectivity check passed for subnet "10.10.10.0"
可以看到cluvfy对两个节点在公网和私网上连通性进行了验证。
Checking ASMLib configuration.
Node Name Status
------------------------------------ ------------------------
RAC2 passed
RAC1 passed
Result: Check for ASMLib configuration passed.
cluvfy对两个节点ASMLib的配置进行了验证。
cluvfy还会进行很多验证,在这里就不一一列出了。
建议部署Oracle RAC的过程中对以下cluvfy验证:
pre crsinst:安装集群管理软件(GI)之前
post crsinst:安装集群管理软件之后
pre dbinst:安装数据库软件之前
Pre dbcfg:创建数据库之前
2.组件
Oracle rac是由很多组件组成的,如果用户希望对集群的某一个组件或者几个组件进行健康检查,就可以使用组件检查。组件检查使得cluvfy功能更加完整灵活。由于cluvfy所支持的组件检查项目很多,就不一一列出了。下面命令列出完整的组件检查列表。
[grid@RAC1 ~]$ cluvfy comp -list
USAGE:
cluvfy comp <component-name> <component-specific options> [-verbose]
Valid Components are:
nodereach : checks reachability between nodes
nodecon : checks node connectivity
cfs : checks CFS integrity
ssa : checks shared storage accessibility
space : checks space availability
sys : checks minimum system requirements
clu : checks cluster integrity
clumgr : checks cluster manager integrity
ocr : checks OCR integrity
olr : checks OLR integrity
ha : checks HA integrity
freespace : checks free space in CRS Home
crs : checks CRS integrity
nodeapp : checks node applications existence
admprv : checks administrative privileges
peer : compares properties with peers
software : checks software distribution
acfs : checks ACFS integrity
asm : checks ASM integrity
gpnp : checks GPnP integrity
gns : checks GNS integrity
scan : checks SCAN configuration
ohasd : checks OHASD integrity
clocksync : checks Clock Synchronization
vdisk : checks Voting Disk configuration and UDEV settings
healthcheck : checks mandatory requirements and/or best practice recommendations
dhcp : checks DHCP configuration
dns : checks DNS configuration
以下命令是cluvfy对集群ASM组件进行的检查。
[grid@RAC1 ~]$ cluvfy comp asm -n RAC1,RAC2 -verbose
Verifying ASM Integrity
Task ASM Integrity check started...
Starting check to see if ASM is running on all cluster nodes...
ASM Running check passed. ASM is running on all specified nodes
Starting Disk Groups check to see if at least one Disk Group configured...
Disk Group Check passed. At least one Disk Group configured
Task ASM Integrity check passed...
Verification of ASM Integrity was successful.
3.cluvfy的修复功能
从11gR2开始,cluvfy推出了修复(fixup功能),也就是-fixup选项。fixup功能可以把检查过程中发现的问题记录下来,并将对应的解决方案生成一个shell脚本(runfixup.sh),同时保存到指定路径下。
如果用户需要解决检查出来的问题,那么只需要用root用户运行一下runfixup.sh脚本就可以了,当然有些问题比较复杂,无法通过shell脚本解决。所以cluvfy的功能只能解决一部分问题,这部分问题就是可修复问题。下面列举了如何简单使用-fixup选项。
[grid@RAC1 ~]$ cluvfy stage -pre crsinst -n RAC1,RAC2 -r 11gR2 -verbose -fixup
Performing pre-checks for cluster services setup
Checking node reachability...
Check: Time zone consistency
Result: Time zone consistency check passed
Fixup information has been generated for following node(s):
RAC2,RAC1
Please run the following script on each node as "root" user to execute the fixups:
'/tmp/CVU_11.2.0.4.0_grid/runfixup.sh'
Pre-check for cluster services setup was unsuccessful on all the nodes.
如果cluvfy在检查过程中检测出一些问题,cluvfy会产生解决问题所需要运行脚本,并且提示使用root用户在集群所有的节点上运行这个脚本。('/tmp/CVU_11.2.0.4.0_grid/runfixup.sh',这个脚本要在所有节点运行修复检查出来问题)
[root@RAC2 ~]# /tmp/CVU_11.2.0.4.0_grid/runfixup.sh
Response file being used is :/tmp/CVU_11.2.0.4.0_grid/fixup.response
Enable file being used is :/tmp/CVU_11.2.0.4.0_grid/fixup.enable
Log file location: /tmp/CVU_11.2.0.4.0_grid/orarun.log
uid=600(grid) gid=600(oinstall) groups=600(oinstall),601(asmadmin),602(asmdba),603(asmoper)
下面是对别博客截取下来的一段,生产环境下面的案例。cluvfy工具对RAC运行环境的检查,如果RAC运行环境的配置不符合要求,可想而知RAC也会经常出问题:
RAC环境本身是否有问题呢?于是,我接下来通过Oracle的cluvfy工具对RAC环境进行检查,很快就发现更严重问题了
grid@ffpdb01:-bash:~$cluvfy comp sys -n all -p crs -verbose
Verifying system requirement
Check: Total memory
Node Name Available Required Status
———— ———————— ———————— ———-
ffpdb02 96GB (1.00663296E8KB) 2GB (2097152.0KB) passed
ffpdb01 96GB (1.00663296E8KB) 2GB (2097152.0KB) passed
Result: Total memory check passed
… …
Check: Hard limits for “maximum open file descriptors”
Node Name Type Available Required Status
—————- ———— ———— ———— —————-
ffpdb02 hard 8192 65536 failed
ffpdb01 hard 8192 65536 failed
Result: Hard limits check failed for “maximum open file descriptors”
… …
Check: Kernel parameter for “tcp_smallest_anon_port”
Node Name Current Required Status
———— ———————— ———————— ———-
ffpdb02 32768 9000 failed (ignorable)
ffpdb01 32768 9000 failed (ignorable)
Result: Kernel parameter check failed for “tcp_smallest_anon_port”
Check: Kernel parameter for “tcp_largest_anon_port”
Node Name Current Required Status
———— ———————— ———————— ———-
ffpdb02 65535 65500 failed (ignorable)
ffpdb01 65535 65500 failed (ignorable)
Result: Kernel parameter check failed for “tcp_largest_anon_port”
Check: Kernel parameter for “udp_smallest_anon_port”
Node Name Current Required Status
———— ———————— ———————— ———-
ffpdb02 32768 9000 failed (ignorable)
ffpdb01 32768 9000 failed (ignorable)
Result: Kernel parameter check failed for “udp_smallest_anon_port”
Check: Kernel parameter for “udp_largest_anon_port”
Node Name Current Required Status
———— ———————— ———————— ———-
ffpdb02 65535 65500 failed (ignorable)
ffpdb01 65535 65500 failed (ignorable)
Result: Kernel parameter check failed for “udp_largest_anon_port”
… …
Verification of system requirement was unsuccessful on all the specified nodes.
我的妈呀,原来这个系统的操作系统核心参数和网络参数都没有满足Oracle RAC安装需求,这将严重导致Oracle GI和RAC运行不正常!这很可能是导致RAC宕机的更重要原因。当然,准确而言,应该是外部应用压力陡增,与RAC环境的上述内部存在问题共同导致了宕机故障。