Cluvfy工具

Cluvfy(cluster verify),简称CVU,是随着Oracle集群管理软件一起发布的检查工具,它的作用是对整个集群系统实施的各个阶段以及各个组件进行检查,并验证是否满足Oracle的要求。cluvfy能够对集群提供非常广泛的检查,包括OS硬件检查,内核参数设置,用户资源设置,网络设置,NTP设置,RAC组件健康检查等。cluvfy在进行检查时候并不会修改系统的配置,所以不会对系统造成影响。cluvfy已经和RAC很多配置工具集成在一起了,例如: OUI在安装GI时候的步骤14(运行前的pre-request检查)就会调用cluvfy进行安装集群之前的检查,具体如下图所示。而在GI安装的最后一步,也会调用cluvfy进行集群安装后的检查。cluvfy检查的内容可以分为两个角度:阶段(stage),组件(component)。

RAC cluvfy工具_oracle

 

1.阶段

Cluvfy把部署RAC整个系统的整个过程划分为了若干个阶段,例如:在安装GI之前需要完成对OS,网络和存储的准备工作,准备工作结束后,接下来就可以进行对GI的安装了,那么可以认为这个阶段就是GI安装前阶段。

在进入下一个阶段之前,或者某一个阶段完成之后,cluvfy都会针对每个阶段的特点提供相应的阶段性检查,以便用户能够确认系统是否准备好进入下一个阶段,或者当前阶段工作是否已经完成,下面的命令列出了cluvfy所支持的所有阶段性检查。(GI安装之前和之后都会进行阶段检查)

[grid@RAC1 ~]$ cluvfy stage -list

USAGE:

cluvfy stage {-pre|-post} <stage-name> <stage-specific options>  [-verbose]

Valid Stages are:

      -pre cfs        : pre-check for CFS setup

      -pre crsinst    : pre-check for CRS installation

      -pre acfscfg    : pre-check for ACFS Configuration.

      -pre dbinst     : pre-check for database installation

      -pre dbcfg      : pre-check for database configuration

      -pre hacfg      : pre-check for HA configuration

      -pre nodeadd    : pre-check for node addition.

      -post hwos      : post-check for hardware and operating system

      -post cfs       : post-check for CFS setup

 -post crsinst   : post-check for CRS installation

      -post acfscfg   : post-check for ACFS Configuration.

      -post hacfg     : post-check for HA configuration

      -post nodeadd   : post-check for node addition.

      -post nodedel   : post-check for node deletion.

 

以下cluvfy命令会进行安装集群之前的系统检查。

[grid@RAC1 ~]$ cluvfy stage -pre crsinst -n RAC1,RAC2 -r 11gR2 -verbose

其中:

-n选项代表需要检查节点的列表。这里需要所有列出的节点之间的用户等效性配置成功。

-r表示需要安装的软件版本。支持的版本有10gR1,10gR2,11gR1,11gR2。

-verbose表示列出检查内容详细信息

 

[grid@RAC1 ~]$ cluvfy stage -pre crsinst -n RAC1,RAC2 -r 11gR2 -verbose

Performing pre-checks for cluster services setup

Checking node reachability...

Checking hosts config file...

  Node Name                            Status                  

  ------------------------------------             ------------------------

  RAC2                                  passed                  

  RAC1                                  passed  

Verification of the hosts config file successful

能够看到cluvfy对两个节点的/etc/hosts文件进行了验证。

 

Check: Node connectivity for interface "eth0"

  Source                          Destination                     Connected?      

  ------------------------------  ------------------------------  ----------------

  RAC2[192.168.56.101]            RAC1[192.168.56.100]            yes             

  RAC2[192.168.56.101]            RAC1[192.168.56.103]            yes             

  RAC2[192.168.56.101]            RAC1[192.168.56.12]             yes             

  RAC2[192.168.56.101]            RAC1[192.168.56.11]             yes             

  RAC1[192.168.56.100]            RAC1[192.168.56.103]            yes             

  RAC1[192.168.56.100]            RAC1[192.168.56.12]             yes             

  RAC1[192.168.56.100]            RAC1[192.168.56.11]             yes             

  RAC1[192.168.56.103]            RAC1[192.168.56.12]             yes             

  RAC1[192.168.56.103]            RAC1[192.168.56.11]             yes             

  RAC1[192.168.56.12]             RAC1[192.168.56.11]             yes             

Result: Node connectivity passed for interface "eth0"

 

Check: TCP connectivity of subnet "192.168.56.0"

  Source                          Destination                     Connected?      

------------------------------         ------------------------------                    ----------------

  RAC1:192.168.56.100             RAC2:192.168.56.101             passed          

  RAC1:192.168.56.100             RAC1:192.168.56.103             passed          

  RAC1:192.168.56.100             RAC1:192.168.56.12              passed          

  RAC1:192.168.56.100             RAC1:192.168.56.11              passed          

Result: TCP connectivity check passed for subnet "192.168.56.0"

 

Check: Node connectivity for interface "eth1"

  Source                          Destination                     Connected?      

  ------------------------------  ------------------------------  ----------------

  RAC2[10.10.10.2]                RAC1[10.10.10.1]                yes             

Result: Node connectivity passed for interface "eth1"

 

Check: TCP connectivity of subnet "10.10.10.0"

  Source                          Destination                     Connected?      

  ------------------------------  ------------------------------  ----------------

  RAC1:10.10.10.1                 RAC2:10.10.10.2                 passed          

Result: TCP connectivity check passed for subnet "10.10.10.0"

可以看到cluvfy对两个节点在公网和私网上连通性进行了验证。

 

Checking ASMLib configuration.

  Node Name                             Status                  

  ------------------------------------  ------------------------

  RAC2                                  passed                  

  RAC1                                  passed                  

Result: Check for ASMLib configuration passed.

cluvfy对两个节点ASMLib的配置进行了验证。

 

cluvfy还会进行很多验证,在这里就不一一列出了。

建议部署Oracle RAC的过程中对以下cluvfy验证:

pre crsinst:安装集群管理软件(GI)之前

post crsinst:安装集群管理软件之后

pre dbinst:安装数据库软件之前

Pre dbcfg:创建数据库之前

 

 

2.组件

Oracle rac是由很多组件组成的,如果用户希望对集群的某一个组件或者几个组件进行健康检查,就可以使用组件检查。组件检查使得cluvfy功能更加完整灵活。由于cluvfy所支持的组件检查项目很多,就不一一列出了。下面命令列出完整的组件检查列表。

[grid@RAC1 ~]$ cluvfy comp -list

USAGE:

cluvfy comp  <component-name> <component-specific options>  [-verbose]

 

Valid Components are:

      nodereach       : checks reachability between nodes

      nodecon         : checks node connectivity

      cfs             : checks CFS integrity

      ssa             : checks shared storage accessibility

      space           : checks space availability

      sys             : checks minimum system requirements

      clu             : checks cluster integrity

      clumgr          : checks cluster manager integrity

      ocr             : checks OCR integrity

      olr             : checks OLR integrity

      ha              : checks HA integrity

      freespace       : checks free space in CRS Home

      crs             : checks CRS integrity

      nodeapp         : checks node applications existence

      admprv          : checks administrative privileges

      peer            : compares properties with peers

      software        : checks software distribution

      acfs            : checks ACFS integrity

      asm             : checks ASM integrity

      gpnp            : checks GPnP integrity

      gns             : checks GNS integrity

      scan            : checks SCAN configuration

      ohasd           : checks OHASD integrity

      clocksync       : checks Clock Synchronization

      vdisk           : checks Voting Disk configuration and UDEV settings

      healthcheck     : checks mandatory requirements and/or best practice recommendations

      dhcp            : checks DHCP configuration

      dns             : checks DNS configuration

 

 

以下命令是cluvfy对集群ASM组件进行的检查。

[grid@RAC1 ~]$ cluvfy comp asm -n RAC1,RAC2 -verbose

Verifying ASM Integrity

Task ASM Integrity check started...

Starting check to see if ASM is running on all cluster nodes...

ASM Running check passed. ASM is running on all specified nodes

Starting Disk Groups check to see if at least one Disk Group configured...

Disk Group Check passed. At least one Disk Group configured

Task ASM Integrity check passed...

Verification of ASM Integrity was successful.

 

 

3.cluvfy的修复功能

从11gR2开始,cluvfy推出了修复(fixup功能),也就是-fixup选项。fixup功能可以把检查过程中发现的问题记录下来,并将对应的解决方案生成一个shell脚本(runfixup.sh),同时保存到指定路径下。

如果用户需要解决检查出来的问题,那么只需要用root用户运行一下runfixup.sh脚本就可以了,当然有些问题比较复杂,无法通过shell脚本解决。所以cluvfy的功能只能解决一部分问题,这部分问题就是可修复问题。下面列举了如何简单使用-fixup选项。

[grid@RAC1 ~]$ cluvfy stage -pre crsinst -n RAC1,RAC2 -r 11gR2 -verbose -fixup

Performing pre-checks for cluster services setup

Checking node reachability...

Check: Time zone consistency

Result: Time zone consistency check passed

Fixup information has been generated for following node(s):

RAC2,RAC1

Please run the following script on each node as "root" user to execute the fixups:

'/tmp/CVU_11.2.0.4.0_grid/runfixup.sh'

Pre-check for cluster services setup was unsuccessful on all the nodes.

 

如果cluvfy在检查过程中检测出一些问题,cluvfy会产生解决问题所需要运行脚本,并且提示使用root用户在集群所有的节点上运行这个脚本。('/tmp/CVU_11.2.0.4.0_grid/runfixup.sh',这个脚本要在所有节点运行修复检查出来问题)

[root@RAC2 ~]# /tmp/CVU_11.2.0.4.0_grid/runfixup.sh

Response file being used is :/tmp/CVU_11.2.0.4.0_grid/fixup.response

Enable file being used is :/tmp/CVU_11.2.0.4.0_grid/fixup.enable

Log file location: /tmp/CVU_11.2.0.4.0_grid/orarun.log

uid=600(grid) gid=600(oinstall) groups=600(oinstall),601(asmadmin),602(asmdba),603(asmoper)

 

 

下面是对别博客截取下来的一段,生产环境下面的案例。cluvfy工具对RAC运行环境的检查,如果RAC运行环境的配置不符合要求,可想而知RAC也会经常出问题:

RAC环境本身是否有问题呢?于是,我接下来通过Oracle的cluvfy工具对RAC环境进行检查,很快就发现更严重问题了

grid@ffpdb01:-bash:~$cluvfy comp sys -n all -p crs -verbose

Verifying system requirement

Check: Total memory

Node Name     Available                 Required                  Status

————  ————————  ————————  ———-

ffpdb02       96GB (1.00663296E8KB)     2GB (2097152.0KB)         passed

ffpdb01       96GB (1.00663296E8KB)     2GB (2097152.0KB)         passed

Result: Total memory check passed

 

… …

Check: Hard limits for “maximum open file descriptors”

Node Name         Type          Available     Required      Status

—————-  ————  ————  ————  —————-

ffpdb02           hard          8192          65536         failed

ffpdb01           hard          8192          65536         failed

Result: Hard limits check failed for “maximum open file descriptors”

 

… …

Check: Kernel parameter for “tcp_smallest_anon_port”

Node Name     Current                   Required                  Status

————  ————————  ————————  ———-

ffpdb02       32768                     9000                      failed (ignorable)

ffpdb01       32768                     9000                      failed (ignorable)

Result: Kernel parameter check failed for “tcp_smallest_anon_port”

 

Check: Kernel parameter for “tcp_largest_anon_port”

Node Name     Current                   Required                  Status

————  ————————  ————————  ———-

ffpdb02       65535                     65500                     failed (ignorable)

ffpdb01       65535                     65500                     failed (ignorable)

Result: Kernel parameter check failed for “tcp_largest_anon_port”

 

Check: Kernel parameter for “udp_smallest_anon_port”

Node Name     Current                   Required                  Status

————  ————————  ————————  ———-

ffpdb02       32768                     9000                      failed (ignorable)

ffpdb01       32768                     9000                      failed (ignorable)

Result: Kernel parameter check failed for “udp_smallest_anon_port”

 

Check: Kernel parameter for “udp_largest_anon_port”

Node Name     Current                   Required                  Status

————  ————————  ————————  ———-

ffpdb02       65535                     65500                     failed (ignorable)

ffpdb01       65535                     65500                     failed (ignorable)

Result: Kernel parameter check failed for “udp_largest_anon_port”

… …

Verification of system requirement was unsuccessful on all the specified nodes.

我的妈呀,原来这个系统的操作系统核心参数和网络参数都没有满足Oracle RAC安装需求,这将严重导致Oracle GI和RAC运行不正常!这很可能是导致RAC宕机的更重要原因。当然,准确而言,应该是外部应用压力陡增,与RAC环境的上述内部存在问题共同导致了宕机故障。