高可用介绍

高可用,大家可能会想到比较简单的Keepalived,或者更早一点的 heartbeat,也可能会用到 Corosync+Pacemaker,那么他们之间有什么区别。

Heartbeat到了v3版本后,拆分为多个子项目:Heartbeat、cluster-glue、Resource Agent、Pacemaker。

Heartbeat:只负责维护集群各节点的信息以及它们之前通信。

Cluster-glue:当于一个中间层,可以将heartbeat和crm(pacemaker)联系起来,主要包含2个部分,LRM和STONITH;

Resource Agent :用来控制服务启停,监控服务状态的脚本集合,这些脚本将被LRM调用从而实现各种资源启动、停止、监控等等。

pacemaker:原Heartbeat 拆分出来的资源管理器,用来管理整个HA的控制中心,客户端通过pacemaker来配置管理监控整个集群。它不能提供底层心跳信息传递的功能,它要想与对方节点通信需要借助底层(新拆分的heartbeat或corosync)的心跳传递服务,将信息通告给对方。

Pacemaker 介绍

什么是Pacemaker

Pacemaker 是集群资源管理器。它实现了集群服务的最大可用性(即。通过使用首选集群基础设施(Corosync 或 Heartbeat)提供的消息传递和成员功能,检测并从节点和资源级故障中恢复。

架构

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IS4ItXR1-1595782926429)(Pacemaker%E4%BB%8B%E7%BB%8D.assets/image-20200719141242214.png)]

内部组件

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mkceyaPE-1595782926438)(Pacemaker%E4%BB%8B%E7%BB%8D.assets/image-20200719141623200.png)]

  • CIB (aka. 集群信息基础) CIB (Aka. Cluster Information Base)
  • CIB 使用 XML 表示集群的配置和集群中所有资源当前状态
  • CIB 的内容在整个集群中自动保持同步
  • PEngine 使用它来计算集群的理想状态以及如何实现它
  • 指令列表将被反馈给 DC (指定协调员 Designated Co-ordinator)
  • CRMd (aka. 集群资源管理守护进程) CRMD (Aka. Cluster Resource Management Daemon)
  • Pacemaker 通过选择一个 CRMd 实例充当主机来集中所有集群决策。
  • 如果选择的 CRMd 进程,或者它所在的节点失败了… … 一个新的进程很快就会建立起来
  • DC(指定协调员 Designated Co-ordinator))
  • DC 按照所需的顺序执行 PEngine 的指令
  • 将它们通过集群消息传递基础结构(集群消息传递基础结构反过来将它们传递给它们的 LRMd 进程)传递给其他节点上的 LRMd (Local Resource Management daemon)或 CRMd 对等点
  • 节点会把他们所有操作的日志发给DC,然后根据预期的结果和实际的结果(之间的差异), 执行下一个等待中的命令,或者取消操作,并让PEngine根据非预期的结果重新计算集群的理想状态。
  • PEngine (aka. PE or 策略引擎) PENGINE (Aka. Pe Or strategy engine)
  • STONITHd
  • 在某些情况下,可能会需要关闭节点的电源来保证共享数据的完整性或是完全地恢复资源。为此Pacemaker引入了STONITHd。
  • STONITH是 Shoot-The-Other-Node-In-The-Head(爆其他节点的头)的缩写,并且通常是靠远程电源开关来实现的
  • N To N 架构
    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oQXXB5wQ-1595782926449)(Pacemaker%E4%BB%8B%E7%BB%8D.assets/image-20200719142256106.png)]

更新:

  • corosync在高可用中处于消息发送层,用于检测节点间通讯是否正常,而pacemaker则用于管理集群资源。通常在使用corosync和pacemaker的时候,我们都会使用统一的工具对它们进行管理,例如旧式的crmsh和新式的pcs。
  • 使用crmsh或者pcs管理的好处是我们不必面向配置文件,而是直接通过命令行的方式管理集群节点,减少编辑配置文件造成的错误。另一个好处是降低学习成本,我们可以不必学习corosync和pacemaker的相关配置命令,只需要学习crmsh或者pcs如何使用。

配置互信

环境:

  • OS 版本:
[root@node0 corosync]# cat /etc/redhat-release
CentOS Linux release 7.8.2003 (Core)
  • IP信息:
node0 192.168.0.70
node1 192.168.0.71
node2 192.168.0.72

永久关闭防火墙并禁止开机启动与Selinux

【ALL】

systemctl stop firewalld.service
systemctl disable firewalld.service
systemctl status firewalld.service

setenforce 0
sed -i '/^SELINUX=/c\SELINUX=disabled'

配置互信

  • node0
ssh-keygen  -t   rsa
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node1
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node2
  • node1
ssh-keygen  -t   rsa
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node0
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node2
  • node2
ssh-keygen  -t   rsa
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node0
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node1

安装 corosync 与 pacemaker/pcs

  • 【ALL】在每个节点上均执行安装命令
yum -y install corosync pacemaker  pcs resource-agents
-- yum install
  • 【ALL】启动 pcs 服务,并设置开机自启动
systemctl start pcsd.service
systemctl enable
  • 【ALL】设置 hacluster密码

安装组件生成的hacluster用户,用来本地启动pcs进程,因此我们需要设定密码,每个节点的密码相同

echo hacluster | passwd
  • 【ONE】查看pcs信息
[root@node0 ~]# rpm -ql pacemaker
  • 【ONE】查看corosync 安装信息
[root@node0 ~]# rpm -ql corosync
  • 【ONE】在某个节点上执行

本例在 node0 上执行

[root@node0 corosync]# pcs  cluster auth node0 node1 node2 -u hacluster -p hacluster --force
  • 【ONE】生成corosync 配置文件(随便在哪个节点上执行均可)
[root@node2 corosync]#
[root@node2 corosync]# pcs cluster setup --name cluster_test01 node0 node1 node2
Destroying cluster on nodes: node0, node1, node2...
node2: Stopping Cluster (pacemaker)...
node0: Stopping Cluster (pacemaker)...
node1: Stopping Cluster (pacemaker)...
node1: Successfully destroyed cluster
node0: Successfully destroyed cluster
node2: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'node0', 'node1', 'node2'
node0: successful distribution of the file 'pacemaker_remote authkey'
node2: successful distribution of the file 'pacemaker_remote authkey'
node1: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
node0: Succeeded
node1: Succeeded
node2: Succeeded

Synchronizing pcsd certificates on nodes node0, node1, node2...
node1: Success
node0: Success
node2: Success
Restarting pcsd on the nodes in order to reload the certificates...
node1: Success
node0: Success
node2: Success

[root@node0 corosync]# ll /etc/corosync/corosync.conf
-rw-r--r--. 1 root root 435 Jul 19 23:39 /etc/corosync/corosync.conf
[root@node0 corosync]#
[root@node1 corosync]# ll /etc/corosync/corosync.conf
-rw-r--r--. 1 root root 435 Jul 19 23:39 /etc/corosync/corosync.conf
[root@node2 corosync]# ll /etc/corosync/corosync.conf
  • 启动集群中的节点
  • 只启动node1
[root@node0 corosync]# pcs cluster start node1
node1: Starting Cluster (corosync)...
node1: Starting Cluster (pacemaker)...
[root@node0 corosync]#
[root@node1 corosync]# ps -ef |grep coro
root 10691 1 8 23:42 ? 00:00:00 corosync
root 10716 9461 0 23:42 pts/1 00:00:00 grep --color=auto coro
[root@node1 corosync]# ps -ef |grep pace
root 10706 1 1 23:42 ? 00:00:00 /usr/sbin/pacemakerd -f
haclust+ 10707 10706 1 23:42 ? 00:00:00 /usr/libexec/pacemaker/cib
root 10708 10706 0 23:42 ? 00:00:00 /usr/libexec/pacemaker/stonithd
root 10709 10706 0 23:42 ? 00:00:00 /usr/libexec/pacemaker/lrmd
haclust+ 10710 10706 0 23:42 ? 00:00:00 /usr/libexec/pacemaker/attrd
haclust+ 10711 10706 0 23:42 ? 00:00:00 /usr/libexec/pacemaker/pengine
haclust+ 10712 10706 0 23:42 ? 00:00:00 /usr/libexec/pacemaker/crmd
root 10718 9461 0 23:42 pts/1 00:00:00 grep --color=auto pace
[root@node1 corosync]#
[root@node1 corosync]#
[root@node1 corosync]#
  • 查看节点状态
[root@node0 corosync]# pcs cluster status
Error: cluster is not currently running on this node

[root@node1 corosync]# pcs cluster status
Cluster Status:
Stack: corosync
Current DC: node1 (version 1.1.21-4.el7-f14e36fd43)
  • 启动所有节点
[root@node0 corosync]# pcs cluster start --all
node1: Starting Cluster (corosync)...
node0: Starting Cluster (corosync)...
node2: Starting Cluster (corosync)...
node0: Starting Cluster (pacemaker)...
node1: Starting Cluster (pacemaker)...
node2: Starting Cluster (pacemaker)...

[root@node0 corosync]# pcs status
Cluster name: cluster_test01

WARNINGS:
No stonith devices and stonith-enabled is not false

Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Sun Jul 19 23:55:09 2020
Last change: Sun Jul 19 23:47:47 2020 by hacluster via crmd on node1

3 nodes configured
0 resources configured

Online: [ node0 node2 ]
OFFLINE: [ node1 ]
  • 解决告警问题
WARNINGS:
No stonith devices and stonith-enabled is not false
root@node0 corosync]# pcs property set stonith-enabled=false
[root@node0 corosync]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Sun Jul 19 23:59:13 2020
Last change: Sun Jul 19 23:57:13 2020 by root via cibadmin on node0

3 nodes configured
0 resources configured

Online: [ node0 node1 node2 ]
  • 查看corosync状态
[root@node0 corosync]# pcs status corosync

Membership information
----------------------
Nodeid Votes Name
1 1 node0 (local)
  • 查看 pacemaker进程
[root@node0 corosync]# ps axf |grep pacemaker
5003 pts/2 S+ 0:00 \_ grep --color=auto pacemaker
4792 ? Ss 0:00 /usr/sbin/pacemakerd -f
4793 ? Ss 0:00 \_ /usr/libexec/pacemaker/cib
4794 ? Ss 0:00 \_ /usr/libexec/pacemaker/stonithd
4795 ? Ss 0:00 \_ /usr/libexec/pacemaker/lrmd
4796 ? Ss 0:00 \_ /usr/libexec/pacemaker/attrd
4797 ? Ss 0:00 \_ /usr/libexec/pacemaker/pengine
4798 ? Ss 0:00 \_ /usr/libexec/pacemaker/crmd
  • 检查配置文件
[root@node0 corosync]# pcs property set stonith-enabled=true
[root@node0 corosync]# crm_verify -L -V
error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid
[root@node0 corosync]# pcs property set stonith-enabled=false
[root@node0 corosync]#
[root@node0 corosync]# crm_verify -L -V
  • 创建VIP
[root@node0 corosync]# pcs resource create VIP ocf:heartbeat:IPaddr2 ip=192.168.0.75 cidr_netmask=32 op monitor interval=30s
[root@node0 corosync]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Jul 20 00:22:48 2020
Last change: Mon Jul 20 00:22:38 2020 by root via cibadmin on node0

3 nodes configured
1 resource configured

Online: [ node0 node1 node2 ]

Full list of resources:

VIP (ocf::heartbeat:IPaddr2): Started node0

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
[root@node0 corosync]#
[root@node0 corosync]#
[root@node0 corosync]# ip ad list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:0c:29:37:1b:18 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.70/24 brd 192.168.0.255 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet 192.168.0.75/32 brd 192.168.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::4427:bd05:1cf9:1f4f/64 scope link tentative noprefixroute dadfailed
valid_lft forever preferred_lft forever
inet6 fe80::19de:291a:ae81:cfd7/64 scope link

查看 pacemaker 默认支持的资源

  • 查看资源采用的标准类型
[root@node0 /]# pcs resource standards
lsb
ocf
service
  • 查看可用的ocf资源提供者
[root@node0 /]# pcs resource providers
  • 查看特定标准下所支持的脚本,例:ofc:heartbeat 下的脚本
[root@node0 /]# pcs resource agents ocf:heartbeat
  • 将 某个节点设置为standby 状态
[root@node0 /]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Jul 20 02:07:41 2020
Last change: Mon Jul 20 00:22:38 2020 by root via cibadmin on node0

3 nodes configured
1 resource configured

Online: [ node0 node1 node2 ]

Full list of resources:

VIP (ocf::heartbeat:IPaddr2): Started node0

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
[root@node0 /]# pcs cluster standby node2
[root@node0 /]#
[root@node0 /]#
[root@node0 /]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Jul 20 02:07:53 2020
Last change: Mon Jul 20 02:07:50 2020 by root via cibadmin on node0

3 nodes configured
1 resource configured

Node node2: standby
Online: [ node0 node1 ]

Full list of resources:

VIP (ocf::heartbeat:IPaddr2): Started node0

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
[root@node0 /]# pcs cluster unstandby node2
[root@node0 /]#
[root@node0 /]#
[root@node0 /]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Jul 20 02:08:04 2020
Last change: Mon Jul 20 02:08:02 2020 by root via cibadmin on node0

3 nodes configured
1 resource configured

Online: [ node0 node1 node2 ]

Full list of resources:

VIP (ocf::heartbeat:IPaddr2): Started node0

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
[root@node0 /]#
  • 重启资源
[root@node0 /]# pcs resource restart  VIP
  • 清理集群错误日志
root@node0 /]# pcs resource cleanup
  • 无法仲裁的时候,选择忽略
[root@node0 /]# pcs property set no-quorum-policy=ignore
[root@node0 /]# pcs property list
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: cluster_test01
dc-version: 1.1.21-4.el7-f14e36fd43
have-watchdog: false
no-quorum-policy: ignore
stonith-enabled: false
[root@node0 /]# pcs property show
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: cluster_test01
dc-version: 1.1.21-4.el7-f14e36fd43
have-watchdog: false
no-quorum-policy: ignore
stonith-enabled: false
  • 设置集群开机自动启动
  • 设置之前
[root@node0 /]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Jul 20 02:19:22 2020
Last change: Mon Jul 20 02:17:39 2020 by root via cibadmin on node0

3 nodes configured
1 resource configured

Online: [ node0 node1 node2 ]

Full list of resources:

VIP (ocf::heartbeat:IPaddr2): Started node0

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
  • 设置
[root@node0 /]#
[root@node0 /]# pcs cluster enable --all
  • 设置之后
[root@node0 /]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Jul 20 02:22:06 2020
Last change: Mon Jul 20 02:17:39 2020 by root via cibadmin on node0

3 nodes configured
1 resource configured

Online: [ node0 node1 node2 ]

Full list of resources:

VIP (ocf::heartbeat:IPaddr2): Started node0

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@node0 /]#

实例1:

Centos 7 下 Corosync + Pacemaker + psc 实现 httpd 服务高可用

安装并启动HTTPD

【ALL】所有节点都安装

yum -y install httpd
service httpd start

[root@node0 /]# service httpd start
Redirecting to /bin/systemctl start httpd.service
[root@node0 /]#
[root@node0 /]# service httpd status
Redirecting to /bin/systemctl status httpd.service
● httpd.service - The Apache HTTP Server
Loaded: loaded (/usr/lib/systemd/system/httpd.service; disabled; vendor preset: disabled)
Active: active (running) since Mon 2020-07-20 02:27:21 CST; 1min 16s ago
Docs: man:httpd(8)
man:apachectl(8)
Main PID: 19333 (httpd)
Status: "Total requests: 10; Current requests/sec: 0; Current traffic: 0 B/sec"
CGroup: /system.slice/httpd.service
├─19333 /usr/sbin/httpd -DFOREGROUND
├─19334 /usr/sbin/httpd -DFOREGROUND
├─19335 /usr/sbin/httpd -DFOREGROUND
├─19337 /usr/sbin/httpd -DFOREGROUND
├─19338 /usr/sbin/httpd -DFOREGROUND
├─19387 /usr/sbin/httpd -DFOREGROUND
├─19388 /usr/sbin/httpd -DFOREGROUND
├─19389 /usr/sbin/httpd -DFOREGROUND
├─19390 /usr/sbin/httpd -DFOREGROUND
├─19391 /usr/sbin/httpd -DFOREGROUND
└─19392 /usr/sbin/httpd -DFOREGROUND

Jul 20 02:27:21 node0 systemd[1]: Starting The Apache HTTP Server...
Jul 20 02:27:21 node0 httpd[19333]: AH00558: httpd: Could not reliably determine the server's fully qualif...ssage
Jul 20 02:27:21 node0 systemd[1]: Started The Apache HTTP Server.
Hint: Some lines were ellipsized, use -l to show in full.
[root@node0 /]#

测试HTTP服务OK

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EVrLcYGP-1595782926469)(Pacemaker%E4%BB%8B%E7%BB%8D.assets/image-20200726214043166.png)]

【ALL】开始Apache URL监控页

vim /etc/httpd/conf.d/status.conf
<Location /server-status>
SetHandler server-status
Order deny,allow
Deny from all
Allow from all
</Location>

【ALL】关闭 httpd 服务,添加httpd 资源时会重新启动http服务,如果不关闭,会报错。

systemctl stop httpd
systemctl status httpd

添加资源 WebSite

注意,这次是在node1上

[root@node1 corosync]# pcs resource create WebSite ocf:heartbeat:apache configfile=/etc/httpd/conf/httpd.conf statusurl="http://localhost/server-status" op monitor interval=30s
[root@node1 corosync]#
[root@node1 corosync]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Jul 20 04:33:32 2020
Last change: Mon Jul 20 04:33:25 2020 by root via cibadmin on node1

3 nodes configured
2 resources configured

Online: [ node0 node1 node2 ]

Full list of resources:

VIP (ocf::heartbeat:IPaddr2): Started node0
WebSite (ocf::heartbeat:apache): Started node1

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@node1 corosync]#

创建了一个httpd 的集群资源 WebSite,主节点在 node1 上。检测页:http://localhost/server-status, 检测时间:30s/次。 但是有一个新的问题,虚拟IP在node0上, httpd资源在 node1上,会导致客户端无法访问。如果VIP在任何节点都不存在,那么WebSite也不能运行。

设置 资源检测超时时间

[root@node1 corosync]# pcs resource op defaults timeout=120s
Warning: Defaults do not apply to resources which override them with their own defined values
[root@node1 corosync]# pcs resource op defaults
timeout=120s
[root@node1 corosyn

绑定服务资源和 VIP 资源,始终保持在一个节点上

[root@node1 corosync]#  pcs constraint colocation add WebSite with VIP INFINITY
[root@node1 corosync]#

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6R2N1vv4-1595782926475)(Pacemaker%E4%BB%8B%E7%BB%8D.assets/image-20200726234823356.png)]

浏览器访问测试

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RJZIcsNm-1595782926482)(Pacemaker%E4%BB%8B%E7%BB%8D.assets/image-20200726234857301.png)]

实例2

Centos 7 下 Corosync + Pacemaker + pcs+ HA-proxy 实现业务高可用

  • 1、删除现有的WebSite 资源【ONE】
[root@node1 corosync]# pcs resource delete WebSite
Attempting to stop: WebSite... Stopped
[root@node1 corosync]#
  • 2、安装 haproxy 服务 【ALL】
yum  -y  install
  • 3、配置 httpd 服务监控本地网卡80服务 【ALL】

Listen 80 修改为 Listen 网卡IP:80

  • node0
grep -w 80 /etc/httpd/conf/httpd.conf
sed -i "/Listen[[:blank:]]80/c\ Listen 192.168.0.70:80" /etc/httpd/conf/httpd.conf
systemctl restart httpd
grep
  • node1
grep -w 80 /etc/httpd/conf/httpd.conf
sed -i "/Listen[[:blank:]]80/c\ Listen 192.168.0.71:80" /etc/httpd/conf/httpd.conf
systemctl restart httpd
grep
  • node2
grep -w 80 /etc/httpd/conf/httpd.conf
sed -i "/Listen[[:blank:]]80/c\ Listen 192.168.0.72:80" /etc/httpd/conf/httpd.conf
systemctl restart httpd
grep
  • 4、配置 haproxy【ALL】
vim /etc/haproxy/haproxy.cfg
追加

#---------------------------------------------------------------------
# listen httpd server
#---------------------------------------------------------------------
  • 5、创建 haproxy 资源
[root@node0 /]# pcs resource create haproxy systemd:haproxy op monitor interval="5s"
[root@node0 /]#
[root@node0 /]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Jul 20 04:57:31 2020
Last change: Mon Jul 20 04:57:27 2020 by root via cibadmin on node0

3 nodes configured
2 resources configured

Online: [ node0 node1 node2 ]

Full list of resources:

VIP (ocf::heartbeat:IPaddr2): Started node0
haproxy (systemd:haproxy): FAILED node1

Failed Resource Actions:
* haproxy_start_0 on node1 'unknown error' (1): call=37, status=complete, exitreason='',
last-rc-change='Mon Jul 20 04:57:28 2020', queued=0ms, exec=2276ms

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@node0 /]#
[root@node0 /]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Jul 20 04:58:39 2020
Last change: Mon Jul 20 04:57:27 2020 by root via cibadmin on node0

3 nodes configured
2 resources configured

Online: [ node0 node1 node2 ]

Full list of resources:

VIP (ocf::heartbeat:IPaddr2): Started node0
haproxy (systemd:haproxy): Stopped

Failed Resource Actions:
* haproxy_start_0 on node2 'unknown error' (1): call=27, status=complete, exitreason='',
last-rc-change='Mon Jul 20 04:57:33 2020', queued=0ms, exec=2242ms
* haproxy_start_0 on node0 'unknown error' (1): call=51, status=complete, exitreason='',
last-rc-change='Mon Jul 20 04:57:37 2020', queued=0ms, exec=2252ms
* haproxy_start_0 on node1 'unknown error' (1): call=37, status=complete, exitreason='',
last-rc-change='Mon Jul 20 04:57:28 2020', queued=0ms, exec=2276ms

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@node0 /]#

资源已创建、启动,但是有报错,这是因为在其他节点的haproxy配置中监控的 虚拟IP并没有落在这些节点上

  • 清除集群报错
[root@node0 ~]# pcs resource cleanup
Cleaned up all resources on all nodes
[root@node0 ~]#
  • 重启 haproxy资源
[root@node0 ~]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Jul 20 05:46:40 2020
Last change: Mon Jul 20 05:46:34 2020 by root via crm_resource on node0

3 nodes configured
2 resources configured

Online: [ node0 node1 node2 ]

Full list of resources:

VIP (ocf::heartbeat:IPaddr2): Started node0
haproxy (systemd:haproxy): Started node0

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
  • 停止node0 ,模拟node0故障
[root@node0 ~]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Jul 20 05:46:40 2020
Last change: Mon Jul 20 05:46:34 2020 by root via crm_resource on node0

3 nodes configured
2 resources configured

Online: [ node0 node1 node2 ]

Full list of resources:

VIP (ocf::heartbeat:IPaddr2): Started node0
haproxy (systemd:haproxy): Started node0

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled



[root@node1 ~]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node0 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Jul 20 05:47:28 2020
Last change: Mon Jul 20 05:46:34 2020 by root via crm_resource on node0

3 nodes configured
2 resources configured

Online: [ node0 node1 node2 ]

Full list of resources:

VIP (ocf::heartbeat:IPaddr2): Started node0
haproxy (systemd:haproxy): Stopping node0

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@node1 ~]# pcs status
Cluster name: cluster_test01
Stack: corosync
Current DC: node2 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Jul 20 05:47:35 2020
Last change: Mon Jul 20 05:46:34 2020 by root via crm_resource on node0

3 nodes configured
2 resources configured

Online: [ node1 node2 ]
OFFLINE: [ node0 ]

Full list of resources:

VIP (ocf::heartbeat:IPaddr2): Started node1
haproxy (systemd:haproxy): Started node1

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
  • 访问web
    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oSzEn9yZ-1595782926490)(Pacemaker%E4%BB%8B%E7%BB%8D.assets/image-20200727010018506.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sBOQDE9m-1595782926500)(Pacemaker%E4%BB%8B%E7%BB%8D.assets/image-20200727010030607.png)]

相关配置文件与日志文件

  • 配置文件
/etc/corosync/corosync.conf -  membership and quorum configuration 
/var/lib/pacemaker/crm/cib.xml - cluster node and resource configuration.
  • 日志文件
/var/log/pacemaker.log

/var/log/cluster/corosync.log

/var/log/pcsd/pcsd.log

/var/log/messages (look for pengine, crmd, ...)

相关概念

  • 像e(单一资源) :它是一个单一的资源,可以由集群来管理。也就是说,资源只能启动一次(例如vip 地址)
  • Clone (克隆资源):克隆应该同时在多个节点上运行的 something 资源,其中一个实例主和另一个实例从,集群管理主从状态
  • Group(资源组):资源组是保持资源在一起的方便方法。组中的资源总是保存在同一个节点上,它们也将按照在组中列出的顺序启动。
  • 资源类别:
  • OCF (Open Cluster Framework): extension of the LSB conventions for init scripts and is t:he most preferred resource class for use in the cluster OCF (Open Cluster Framework) : init 脚本和 is 的 LSB 约定的扩展: 它是集群中使用的最佳资源类
  • LSB (Linux Standard Base):standard Linux init scripts LSB (Linux 标准基础) : 标准 Linux init 脚本
  • Systemd: systemd command 系统: 系统命令
  • Fencing: fencing-related resources 击剑: 击剑相关资源
  • Service: mixed cluster where nodes use the systemd, upstart, and lsb commands 服务: 混合集群,其中节点使用 systemd、 upstart 和 lsb 命令
  • Nagios: Nagios plugins Nagios: Nagios 插件
  • 约束:约束是一组定义如何加载资源(组)的规则。
  • 约束类型:
  • location:位置约束,定义了应该在哪个服务器上加载资源(或者永远不加载,优先级为负)
  • colocation:托管约束,用于定义什么特定资源应该一起加载,或者,它们永远不应该一起加载
  • order:
  • 顺序约束用于定义特定的顺序,
  • 顺序约束隐含在资源组中,
  • 使用订单约束可能更方便,因为您可以在不同类型的资源之间定义这些约束。例如,您可以定义一个资源组只能在某个特定的单一资源加载之后才能加载。
  • 得分:用于决定约束的优先级别
  • 您可以使用从 -1,000,000(- INFINITY = may never happen)到 INFINITY (1,000,000 = must happen)的分数。
  • 为了表示你从来不希望某个动作被执行,你可以使用负分。任何小于0的分数都将禁止节点中的资源。

集群相关操作

  • 检查集群状态
# pcs status
# pcs config

# pcs cluster status
# pcs quorum status
# pcs resource show
# crm_verify -L -V

# crm_mon
  • 销毁集群
# pcs cluster destroy <cluster_name>
  • 启停集群
# pcs cluster start --all
# pcs cluster stop --all
  • 启停节点
# pcs cluster start <node>
# pcs cluster stop <node>
  • 强制停止节点上的群集服务
# pcs cluster kill
  • 将节点置于待机状态
# pcs cluster standby <node1>
  • 将节点从待机状态取消
# pcs cluster unstandby <node1>
  • 设置群集属性
# pcs property set <property>=<value>
  • 禁用 fencing
# pcs property set stonith-enabled=false

详细操作

  • [show [] | --full | --groups | --hide-inactive]
    Show all currently configured resources or if a resource is specified
    show the options for the configured resource. If --full is specified,
    all configured resource options will be displayed. If --groups is
    specified, only show groups (and their resources). If --hide-inactive
    is specified, only show active resources.
[root@node0 ~]# pcs resource show
VIP (ocf::heartbeat:IPaddr2): Started node0
haproxy (systemd:haproxy): Started node0
  • 列出所有可以创建的资源,并从中过滤
[root@node1 ~]#  pcs resource list |grep Ipadd -i
ocf:heartbeat:IPaddr - Manages virtual IPv4 and IPv6 addresses (Linux specific
ocf:heartbeat:IPaddr2 - Manages virtual IPv4 and IPv6 addresses (Linux specific
  • 描述具体某个资源
[root@node1 ~]#  pcs resource describe IPaddr2
Assumed agent name 'ocf:heartbeat:IPaddr2' (deduced from 'IPaddr2')
ocf:heartbeat:IPaddr2 - Manages virtual IPv4 and IPv6 addresses (Linux specific version)

This Linux-specific resource manages IP alias IP addresses.
It can add an IP alias, or remove one.
In addition, it can implement Cluster Alias IP functionality
if invoked as a clone resource.

If used as a clone, you should explicitly set clone-node-max >= 2,
and/or clone-max < number of nodes. In case of node failure,
clone instances need to be re-allocated on surviving nodes.
This would not be possible if there is already an instance on those nodes,
and clone-node-max=1 (which is the default).

Resource options:
ip (required) (unique): The IPv4 (dotted quad notation) or IPv6 address (colon hexadecimal notation) example IPv4 "192.168.1.1". example IPv6
"2001:db8:DC28:0:0:FC57:D4C8:1FFF".
nic: The base network interface on which the IP address will be brought online. If left empty, the script will try and determine this from the
routing table. Do NOT specify an alias interface in the form eth0:1 or anything here; rather, specify the base interface only. If you want a
label, see the iflabel parameter. Prerequisite: There must be at least one static IP address, which is not managed by the cluster, assigned
to the network interface. If you can not assign any static IP address on the interface, modify this kernel parameter: sysctl -w
net.ipv4.conf.all.promote_secondaries=1 # (or per device)
cidr_netmask: The netmask for the interface in CIDR format (e.g., 24 and not 255.255.255.0) If unspecified, the script will also try to determine
this from the routing table.
broadcast: Broadcast address associated with the IP. It is possible to use the special symbols '+' and '-' instead of the broadcast address. In
this case, the broadcast address is derived by setting/resetting the host bits of the interface prefix.
iflabel: You can specify an additional label for your IP address here. This label is appended to your interface name. The kernel allows
alphanumeric labels up to a maximum length of 15 characters including the interface name and colon (e.g. eth0:foobar1234) A label can be
specified in nic parameter but it is deprecated. If a label is specified in nic name, this parameter has no effect.
lvs_support: Enable support for LVS Direct Routing configurations. In case a IP address is stopped, only move it to the loopback device to allow
the local node to continue to service requests, but no longer advertise it on the network. Notes for IPv6: It is not necessary to
enable this option on IPv6. Instead, enable 'lvs_ipv6_addrlabel' option for LVS-DR usage on IPv6.
lvs_ipv6_addrlabel: Enable adding IPv6 address label so IPv6 traffic originating from the address's interface does not use this address as the
source. This is necessary for LVS-DR health checks to realservers to work. Without it, the most recently added IPv6 address
(probably the address added by IPaddr2) will be used as the source address for IPv6 traffic from that interface and since that
address exists on loopback on the realservers, the realserver response to pings/connections will never leave its loopback. See
RFC3484 for the detail of the source address selection. See also 'lvs_ipv6_addrlabel_value' parameter.
lvs_ipv6_addrlabel_value: Specify IPv6 address label value used when 'lvs_ipv6_addrlabel' is enabled. The value should be an unused label in the
policy table which is shown by 'ip addrlabel list' command. You would rarely need to change this parameter.
mac: Set the interface MAC address explicitly. Currently only used in case of the Cluster IP Alias. Leave empty to chose automatically.
clusterip_hash: Specify the hashing algorithm used for the Cluster IP functionality.
unique_clone_address: If true, add the clone ID to the supplied value of IP to create a unique address to manage
arp_interval: Specify the interval between unsolicited ARP packets in milliseconds. This parameter is deprecated and used for the backward
compatibility only. It is effective only for the send_arp binary which is built with libnet, and send_ua for IPv6. It has no effect
for other arp_sender.
arp_count: Number of unsolicited ARP packets to send at resource initialization.
arp_count_refresh: Number of unsolicited ARP packets to send during resource monitoring. Doing so helps mitigate issues of stuck ARP caches
resulting from split-brain situations.
arp_bg: Whether or not to send the ARP packets in the background.
arp_sender: The program to send ARP packets with on start. Available options are: - send_arp: default - ipoibarping: default for infiniband
interfaces if ipoibarping is available - iputils_arping: use arping in iputils package - libnet_arping: use another variant of arping
based on libnet
send_arp_opts: Extra options to pass to the arp_sender program. Available options are vary depending on which arp_sender is used. A typical use
case is specifying '-A' for iputils_arping to use ARP REPLY instead of ARP REQUEST as Gratuitous ARPs.
flush_routes: Flush the routing table on stop. This is for applications which use the cluster IP address and which run on the same physical host
that the IP address lives on. The Linux kernel may force that application to take a shortcut to the local loopback interface,
instead of the interface the address is really bound to. Under those circumstances, an application may, somewhat unexpectedly,
continue to use connections for some time even after the IP address is deconfigured. Set this parameter in order to immediately
disable said shortcut when the IP address goes away.
run_arping: Whether or not to run arping for IPv4 collision detection check.
preferred_lft: For IPv6, set the preferred lifetime of the IP address. This can be used to ensure that the created IP address will not be used as
a source address for routing. Expects a value as specified in section 5.5.4 of RFC 4862.
monitor_retries: Set number of retries to find interface in monitor-action. ONLY INCREASE IF THE AGENT HAS ISSUES FINDING YOUR NIC DURING THE
MONITOR-ACTION. A HIGHER SETTING MAY LEAD TO DELAYS IN DETECTING A FAILURE.

Default operations:
start: interval=0s timeout=20s
stop: interval=0s timeout=20s
monitor: interval=10s timeout=20s

参考:
​​​ https://www.freenetst.it/tech/rh7cluster/​​ 《Red Hat Enterprise Linux 7 High Availability Add-On Reference》