LINUX集群学习二——pacemaker+corosync+pcs实验

原创

土麻豆Hero 2018-05-01 11:37:44 博主文章分类：linux集群 ©著作权

©著作权归作者所有：来自51CTO博客作者土麻豆Hero的原创作品，请联系作者获取转载授权，否则将追究法律责任

实验目的：使用corosync作为集群消息事务层（Massage Layer），pacemaker作为集群资源管理器（Cluster Resource Management），pcs作为CRM的管理接口工具。要求实现httpd的高可用功能。环境：centos 6.9 Pacemaker: 1.1.15 Corosync:1.47 Pcs: 0.9.155 准备工作：

配置SSH双机互信;
配置主机名解析/etc/hosts文件；
关闭防火墙：service iptables stop
关闭selunux：setenforce 0
关闭networkmanager: chkconfig NetworkManager off 、service NetworkManager stop 一、软件安装使用yum源可以直接安装corosync pacemaker以及pcs软件： yum install corosync pacemaker pcs -y

二、开启pcsd服务，两台都要开启 Service pcsd start

[root@node1 ~]# service pcsd start Starting pcsd: [ OK ] [root@node1 ~]# [root@node1 ~]# [root@node1 ~]# ssh node2 "service pcsd start" Starting pcsd: [ OK ] [root@node1 ~]#

三、设置hacluster账号的密码，两台都要设置为hacluster设置一个密码，用于pcs与pcsd通信 [root@node1 ~]# grep "hacluster" /etc/passwd hacluster:x:496:493:heartbeat user:/var/lib/heartbeat/cores/hacluster:/sbin/nologin [root@node2 ~]# passwd hacluster Changing password for user hacluster. New password: Retype new password: passwd: all authentication tokens updated successfully. [root@node2 ~]#

四、配置corosync，生成/etc/cluster/cluster.conf文件，在指定的节点上给pcs与pcsd完成认证。 [root@node1 ~]# pcs cluster auth node1 node2 Username: hacluster //输入上面设置的帐户及密码 Password: node1: Authorized Error: Unable to communicate with node2

[root@node1 ~]# service iptables stop iptables: Setting chains to policy ACCEPT: filter [ OK ] iptables: Flushing firewall rules: [ OK ] iptables: Unloading modules: [ OK ] [root@node1 ~]# [root@node1 ~]# [root@node1 ~]# getenforce Disabled

[root@node1 ~]# pcs cluster auth node1 node2 Username: hacluster Password: node1: Authorized node2: Authorized tcp 0 0 :::2224 :::* LISTEN 2303/ruby
备注：务必要关闭防火墙，或者配置iptables放行tcp 2224端口。

pcs cluster setup --name mycluster node1 node2 设置集群相关的参数只有两台服务器pcsd服务、networkmanager关闭等基础条件准备好后才能够成功配置集群的参数，并且生成cluster.conf文件 [root@node1 corosync]# pcs cluster setup --name webcluster node1 node2 --force Destroying cluster on nodes: node1, node2... node1: Stopping Cluster (pacemaker)... node2: Stopping Cluster (pacemaker)... node1: Successfully destroyed cluster node2: Successfully destroyed cluster

Sending cluster config files to the nodes... node1: Updated cluster.conf... node2: Updated cluster.conf...

Synchronizing pcsd certificates on nodes node1, node2... node1: Success node2: Success

Restarting pcsd on the nodes in order to reload the certificates... node1: Success node2: Success

[root@node1 corosync]# cat /etc/cluster/cluster.conf //生成的配置文件 <cluster config_version="9" name="webcluster"> <fence_daemon/> <clusternodes> <clusternode name="node1" nodeid="1"> <fence> <method name="pcmk-method"> <device name="pcmk-redirect" port="node1"/> </method> </fence> </clusternode> <clusternode name="node2" nodeid="2"> <fence> <method name="pcmk-method"> <device name="pcmk-redirect" port="node2"/> </method> </fence> </clusternode> </clusternodes> <cman broadcast="no" expected_votes="1" transport="udp" two_node="1"/> <fencedevices> <fencedevice agent="fence_pcmk" name="pcmk-redirect"/> </fencedevices> <rm> <failoverdomains/> <resources/> </rm>

五、启动集群服务 pcs cluster start --all //注意关闭NetworkManager服务可以使用pcs --debug cluster start --all 打开调试模式，检查其中的错误，提示需要关闭NetworkManager服务

六、查看集群节点状态： Pcs status Pcs status corosync Pcs status cluster [root@node1 corosync]# pcs status Cluster name: webcluster WARNING: no stonith devices and stonith-enabled is not false Stack: cman Current DC: node1 (version 1.1.15-5.el6-e174ec8) - partition with quorum Last updated: Mon Apr 30 10:06:42 2018 Last change: Mon Apr 30 09:26:32 2018 by root via crmd on node2

2 nodes and 0 resources configured

Online: [ node1 node2 ]

No resources

Daemon Status: cman: active/disabled corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled [root@node1 corosync]# pcs status corosync Nodeid Name 1 node1 2 node2

检查一下是有配置错误，可以看到都是和stonith相关 Crm_verify –L –V 使用下面的命令关闭这些错误： pcs property set stonith-enabled=false #关掉这些错误

七、配置服务

配置VIP服务 pcs resource create vip ocf:heartbeat:IPaddr2 ip=192.168.110.150 cidr_netmask=24 op monitor interval=30s pcs status 查看资源是否启动，注意这里测试发现掩码需要配置为与网卡同掩码，否则资源启动不了
配置httpd服务这里有两种方式，使用ocf:heartbeat:apache或者使用lsb:httpd方式，前者需要手工在两台服务器上将httpd服务启动，而后者服务由pacemaker集群启动。 pcs resource create web lsb:httpd op monitor interval=20s pcs status 可以看到资源已经启动。

同时，可以在对应的节点上面直接service httpd status查看服务是否启动，以及ip addr 查看VIP是否获取到。

八、资源约束配置

配置资源的启动顺序：order，要求vip先启动，web后启动。 pcs constraint order vip then web
配置位置约束，希望资源优先在node1节点上运行，设置vip/web对node1节点的优先级为150，对node2节点的优先级为50： pcs constraint location web prefers node1=150 pcs constraint location vip prefers node1=150 pcs constraint location web prefers node2=50 pcs constraint location vip prefers node2=50 [root@node1 ~]# pcs constraint

Location Constraints: Resource: vip Enabled on: node1 (score:100) Enabled on: node2 (score:50) Resource: web Enabled on: node1 (score:100) Enabled on: node2 (score:50)

注意：如果多个资源分布在不同的设备上，而这些资源又必须共同在同一个设备上才能够正常的对外提供服务，那么这个集群将不能正常工作。可以看到只有web以及vip对node1的优先级都调整为150后，集群才能够正常对外提供服务，否则会出现两个资源分布在不同的设备而导致不能对外提供服务 3. 配置资源组，只有两者对节点的位置优先级调整为一样后，资源组同时切换： pcs resource group add mygroup vip web [root@node2 ~]# pcs status groups mygroup: vip web [root@node1 ~]# pcs resource Resource Group: httpgroup vip (ocf::heartbeat:IPaddr2): Started node1 web (lsb:httpd): Started node1

[root@node1 ~]# crm_simulate -sL

Current cluster status: Online: [ node1 node2 ]

Resource Group: httpgroup vip (ocf::heartbeat:IPaddr2): Started node1 web (lsb:httpd): Started node1

Allocation scores: group_color: httpgroup allocation score on node1: 0 group_color: httpgroup allocation score on node2: 0 group_color: vip allocation score on node1: 100 group_color: vip allocation score on node2: 50 group_color: web allocation score on node1: 100 group_color: web allocation score on node2: 50 native_color: web allocation score on node1: 200 native_color: web allocation score on node2: 100 native_color: vip allocation score on node1: 400 native_color: vip allocation score on node2: 150

也可以将整个资源组作为整体调整优先级，如下： pcs constraint location httpgroup prefers node2=100 pcs constraint location httpgroup prefers node1=200

[root@node1 ~]# pcs constraint Location Constraints: Resource: httpgroup Enabled on: node1 (score:200) Enabled on: node2 (score:100) Resource: vip Enabled on: node1 (score:100) Enabled on: node2 (score:50) Resource: web Enabled on: node1 (score:100) Enabled on: node2 (score:50) Ordering Constraints: start vip then start web (kind:Mandatory)

配置排列约束，让vip与web 资源运行在一起，分数为100 [root@node1 ~]# pcs constraint colocation add vip with web 100 [root@node1 ~]# pcs constraint show Location Constraints: Resource: httpgroup Enabled on: node1 (score:200) Enabled on: node2 (score:100) Resource: vip Enabled on: node1 (score:100) Enabled on: node2 (score:50) Resource: web Enabled on: node1 (score:100) Enabled on: node2 (score:50) Ordering Constraints: start vip then start web (kind:Mandatory) Colocation Constraints: vip with web (score:100) Ticket Constraints:

九、将资源切换到node2上面 pcs constraint location web prefers node1=100 //将web资源对node1的位置优先级调整为100，可以看到资源从node2转换到node1，注意可以调整httpgroup，也可以同时调整web以及vip对node2的优先级。 May 1 09:43:02 node1 crmd[2965]: notice: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph May 1 09:43:02 node1 pengine[2964]: warning: Processing failed op monitor for web on node2: not running (7) May 1 09:43:02 node1 pengine[2964]: notice: Move web#011(Started node2 -> node1) May 1 09:43:02 node1 pengine[2964]: notice: Calculated transition 4, saving inputs in /var/lib/pacemaker/pengine/pe-input-57.bz2 May 1 09:43:02 node1 crmd[2965]: notice: Initiating stop operation web_stop_0 on node2 | action 6 May 1 09:43:02 node1 crmd[2965]: notice: Initiating start operation web_start_0 locally on node1 | action 7 May 1 09:43:03 node1 lrmd[2962]: notice: web_start_0:3682:stderr [ httpd: Could not reliably determine the server's fully qualified domain name, using node1.yang.com for ServerName ] May 1 09:43:03 node1 crmd[2965]: notice: Result of start operation for web on node1: 0 (ok) | call=12 key=web_start_0 confirmed=true cib-update=42 May 1 09:43:03 node1 crmd[2965]: notice: Initiating monitor operation web_monitor_20000 locally on node1 | action 8 May 1 09:43:03 node1 crmd[2965]: notice: Transition 4 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-57.bz2): Complete May 1 09:43:03 node1 crmd[2965]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE | input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd

十、遗留问题：节点配置web资源后，始终有一个报错未能解决，如下： [root@node1 ~]# pcs status Cluster name: mycluster Stack: cman Current DC: node1 (version 1.1.15-5.el6-e174ec8) - partition with quorum Last updated: Tue May 1 11:13:00 2018 Last change: Tue May 1 11:04:20 2018 by root via cibadmin on node1

2 nodes and 2 resources configured

Online: [ node1 node2 ]

Full list of resources:

Resource Group: httpgroup vip (ocf::heartbeat:IPaddr2): Started node1 web (lsb:httpd): Started node1

	Failed Actions:                                     //看着像和监控有关，但一直未能弄明白原因
* web_monitor_20000 on node2 'not running' (7): call=11, status=complete, exitreason='none',
    last-rc-change='Tue May  1 09:41:09 2018', queued=0ms, exec=16ms

Daemon Status: cman: active/disabled corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled

命令： pcs cluster 设备集群的pcsd认证、集群参数、启动集群节点、删除节点等功能。

pcs cluster stop node1 // 关闭集群中的node1 [root@node2 ~]# pcs status Cluster name: mycluster Stack: cman Current DC: node2 (version 1.1.15-5.el6-e174ec8) - partition with quorum Last updated: Sat Apr 28 02:23:00 2018 Last change: Sat Apr 28 02:16:12 2018 by root via cibadmin on node2

2 nodes and 2 resources configured

Online: [ node2 ] OFFLINE: [ node1 ]

Full list of resources:

Resource Group: mygroup vip (ocf::heartbeat:IPaddr2): Started node2 web (lsb:httpd): Started node2

Daemon Status: cman: active/disabled corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled 此时由于node1上的集群功能已经关闭，无法在node2上直接开启，需要在node1上开启： [root@node1 ~]# pcs status Error: cluster is not currently running on this node [root@node1 ~]# [root@node1 ~]# [root@node1 ~]# pcs cluster start node1 node1: Starting Cluster... pcs resource ：资源相关的命令，包括资源创建、资源删除、资源使能、描述等。 Pcs constraint：资源约束相关的配置命令 Pcs status ：资源状态查看相关的命令。