pacemaker集群主备切换过程时间过长 pg集群主备切换

转载
mob6454cc73e9a6 2024-07-28 14:21:04
基于PGPool的双机集群如下图所示：pg主节点和备节点实现流复制热备，pgpool1，pgpool2作为中间件，将主备pg节点加入集群，实现读写分离，负载均衡和HA故障自动切换。两pgpool节点可以委托一个虚拟ip节点作为应用程序访问的地址，两节点之间通过watchdog进行监控，当pgpool1宕机时，pgpool2会自动接管虚拟ip继续对外提供不间断服务。
 
1.	主机规划
192.168.20.201 redis01
192.168.20.202 redis02
192.168.20.203 vip
2.	配置主机ssh互信
ssh-keygen -t rsa
ssh-copy-id -i ~/.ssh/id_rsa.pub   redis02

3.	安装pgpool
wget http://www.pgpool.net/mediawiki/images/pgpool-II-4.1.1.tar.gz
[postgres@redis01 ~]$ tar -zxf pgpool-II-4.1.0.tar.gz
./configure  --prefix=/u01/pgpool --with-pgsql=/u01/pgsql
make && make install
4.	配置pgpool
4.1配置环境变量
pgpool装在了postgres账户下，在该账户中添加环境变量，master,slave节点都执行。
[postgres@redis01 ~]$ cat .bash_profile 
# .bash_profile
export PGHOME=/u01/pgsql  
export PGDATA=/u01/pgsql/data        
export PGPOOLHOME=/u01/pgpool
export PGPASSFILE=/u01/pgsql/data/.pgpass
export PATH=$PATH:$HOME/bin:$PGHOME/bin:$PGPOOLHOME/bin
export LD_LIBRARY_PATH=$PGHOME/lib:$PGPOOLHOME/lib
4.2配置pool_hba.conf
pool_hba.conf是对登录用户进行验证的，要和pg的pg_hba.conf保持一致，要么都是trust，要么都是md5验证方式，这里采用了md5验证方式如下设置：
# IPv4 local connections:
host    all             all             127.0.0.1/32            trust
host    postgres        postgres        192.168.20.0/24       md5
host    postgres        rep             192.168.20.0/24       md5
host    all         all         0.0.0.0/0             md5

# IPv6 local connections:
host    all             all             ::1/128                 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication     all                                     trust
host    replication     all             127.0.0.1/32            trust
host    replication     rep         192.168.20.0/24       md5
host    replication     all             ::1/128                 trust
4.3配置pcp.conf
pcp.conf配置用于pgpool自己登陆管理使用的，一些操作pgpool的工具会要求提供密码等，配置如下：
pg_md5 posgres
[postgres@redis01 etc]$ cat pcp.conf
# USERID:MD5PASSWD
postgres:e8a48653851e28c69d0506508fb27fc5
#pcp.conf是pgpool管理器自己的用户名和密码，用于管理集群
pg_md5 -p -m -u postgres pool_passwd
#数据库登录用户是postgres,这里输入登录密码，不能出错
#输入密码后，在pgpool/etc目录下会生成一个pool_passwd文件
4.4配置系统命令权限
配置 ifconfig, arping 执行权限 ，执行failover_stream.sh需要用到，可以让其他普通用户执行。
chmod u+s /sbin/ip
chmod u+s /usr/sbin/arping
4.5配置pgpool.conf
后台连接设置
# - Backend Connection Settings -
backend_hostname0 = 'redis01'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/u01/pgsql/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_application_name0 = 'server1'

backend_hostname1 = 'redis02'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/u01/pgsql/data'
backend_flag1 = 'ALLOW_TO_FAILOVER'
backend_application_name1 = 'server2'
负载均衡
load_balance_mode = on
postgres流复制
master_slave_mode = on
master_slave_sub_mode = 'stream'
健康检查设置
#------------------------------------------------------------------------------
# HEALTH CHECK GLOBAL PARAMETERS
#------------------------------------------------------------------------------
health_check_period = 10
health_check_timeout = 20
health_check_user = 'postgres'
health_check_password = 'postgres'
health_check_database = ''
health_check_max_retries = 0
health_check_retry_delay = 1
connect_timeout = 10000
灾备切换设置
failover_command = '/u01/pgpool/failover_stream.sh %H %h'
                                   # Executes this command at failover
                                   # Special values:
                                   #   %d = failed node id
                                   #   %h = failed node host name
                                   #   %p = failed node port number
                                   #   %D = failed node database cluster path
                                   #   %m = new master node id
                                   #   %H = new master node hostname
                                   #   %M = old master node id
                                   #   %P = old primary node id
                                   #   %r = new master port number
                                   #   %R = new master database cluster path
                                   #   %N = old primary node hostname
                                   #   %S = old primary node port number
                                   #   %% = '%' character

开启看门狗
#------------------------------------------------------------------------------
# WATCHDOG
#------------------------------------------------------------------------------

# - Enabling -
use_watchdog = on
# -Connection to up stream servers -
trusted_servers = '192.168.20.201,192.168.20.202'
ping_path = '/bin'
# - Watchdog communication Settings -
wd_hostname = 'redis01'
wd_port = 9000
wd_priority = 1
wd_authkey = ''
wd_ipc_socket_dir = '/tmp'

配置vip
# - Virtual IP control Setting -

delegate_IP = '192.168.20.205'
if_cmd_path = '/sbin'
if_up_cmd = ' /sbin/ip addr add $_IP_$/24 dev eth2 label eth2:0'
if_down_cmd = '/sbin/ip addr del $_IP_$/24 dev eth2'
arping_path = '/usr/sbin'
arping_cmd = '/usr/sbin/arping -U $_IP_$ -w 1 -I eth2'

其他pgpool连接设置
# - Other pgpool Connection Settings -
other_pgpool_hostname0 = 'redis02'
other_pgpool_port0 =9999 
other_wd_port0 = 9000
4.6配置failover_stream.sh脚本
[postgres@redis01 pgpool]$ cat failover_stream.sh 
#! /bin/sh 
# Failover command for streaming replication. 
# Arguments: $1: new master hostname. 

new_master=$1 
old_master=$2
RECOVERYCONF=$PGDATA/recovery.conf
touch_command="touch $RECOVERYCONF"
trigger_command="$PGHOME/bin/pg_ctl promote -D $PGDATA" 
delrecovery_command="rm -f  $PGDATA/recovery.done && rm -f  $PGDATA/recovery.conf"
echo1_cmd="echo -e  \" primary_conninfo = 'host=$new_master port=5432  user=rep password=rep passfile=''/u01/pgsql/data/.pgpass''' \" >> $PGDATA/recovery.conf"
echo2_cmd="echo -e \" recovery_target_timeline = 'latest' \"  >> $PGDATA/recovery.conf"
echo3_cmd="echo -e \" standby_mode = 'on' \"  >> $PGDATA/recovery.conf"



# Prompte standby database. 
/usr/bin/ssh -T $new_master $trigger_command 
# create recovery.conf 
/usr/bin/ssh -T $old_master $delrecovery_command
/usr/bin/ssh -T $old_master  $touch_command
/usr/bin/ssh -T $old_master  $echo3_cmd
/usr/bin/ssh -T $old_master  $echo1_cmd
/usr/bin/ssh -T $old_master  $echo2_cmd
exit 0;
检测到postgresql down机以后把standby转成primary模式可读写，在down机的主机生成recovery.conf，down机的postgresql启动起来以后自动转成standby。




4.7启动pgpool
启动pgpool以前先启动postgresql, postgresql已配置好流复制
启动postgressql
pg_ctl start
查看流复制
[postgres@redis01 etc]$ psql -c 'SELECT client_addr,application_name,sync_state FROM pg_stat_replication;'
  client_addr   | application_name | sync_state 
----------------+------------------+------------
 192.168.20.202 | walreceiver      | sync
启动pgpool
pgpool -n  -D  > /u01/pgpool/log/pgpool.log 2>&1 &
查看pgpool节点
[postgres@redis01 etc]$ psql -h 192.168.20.205 -p 9999 -c 'show pool_nodes'
node_id | hostname | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | redis01  | 5432 | up     | 0.500000  | primary | 0          | true              | 0                 |                   |                        | 2020-02-23 22:04:41
 1       | redis02  | 5432 | up     | 0.500000  | standby | 0          | false             | 0                 |                   |                        | 2020-02-23 22:04:41

查看vip
[postgres@redis01 ~]$ ifconfig -a
eth1      Link encap:Ethernet  HWaddr 08:00:27:E9:92:7B  
          inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fee9:927b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:11 errors:0 dropped:0 overruns:0 frame:0
          TX packets:19 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1921 (1.8 KiB)  TX bytes:1968 (1.9 KiB)

eth2      Link encap:Ethernet  HWaddr 08:00:27:CF:77:EB  
          inet addr:192.168.20.201  Bcast:192.168.20.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fecf:77eb/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:32488 errors:0 dropped:0 overruns:0 frame:0
          TX packets:31045 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3737971 (3.5 MiB)  TX bytes:3848610 (3.6 MiB)

eth2:0    Link encap:Ethernet  HWaddr 08:00:27:CF:77:EB  
          inet addr:192.168.20.205  Bcast:0.0.0.0  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

5.	演示postgresql down机

[postgres@redis01 ~]$ pg_ctl stop -m fast
waiting for server to shut down..... done
server stopped

redis01 postgressql日志
2020-02-23 22:45:21.402 CST [2577] LOG:  received fast shutdown request
2020-02-23 22:45:21.682 CST [2577] LOG:  aborting any active transactions
2020-02-23 22:45:21.682 CST [2577] LOG:  background worker "logical replication launcher" (PID 2586) exited with exit code 1
2020-02-23 22:45:21.683 CST [2580] LOG:  shutting down
2020-02-23 22:45:23.174 CST [2577] LOG:  database system is shut down

redis02 postgressql日志
2020-02-23 22:45:23.174 CST [12119] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
		Is the server running on host "redis01" (192.168.20.201) and accepting
		TCP/IP connections on port 5432?
2020-02-23 22:45:28.180 CST [12126] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
		Is the server running on host "redis01" (192.168.20.201) and accepting
		TCP/IP connections on port 5432?
2020-02-23 22:45:31.395 CST [3584] LOG:  received promote request
2020-02-23 22:45:31.395 CST [3584] LOG:  redo done at 0/26000028
2020-02-23 22:45:31.405 CST [3584] LOG:  selected new timeline ID: 12
2020-02-23 22:45:32.389 CST [3584] LOG:  archive recovery complete
2020-02-23 22:45:32.402 CST [3582] LOG:  database system is ready to accept connections

查看主备
[postgres@redis02 ~]$ pg_controldata|grep cluster
Database cluster state:               in production
[postgres@redis02 ~]$ psql -c 'SELECT client_addr,application_name,sync_state FROM pg_stat_replication;'
 client_addr | application_name | sync_state 
-------------+------------------+------------
(0 rows)

Redis02主机上的postgressql数据库直接从standby转成primary

查看pgpool集群
[postgres@redis01 pgpool]$ psql -h 192.168.20.205 -p 9999 -c 'show pool_nodes'
 node_id | hostname | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | redis01  | 5432 | down   | 0.500000  | standby | 0          | false             | 0                 |                   |                        | 2020-02-23 22:45:33
 1       | redis02  | 5432 | up     | 0.500000  | primary | 0          | true              | 0                 |                   |                        | 2020-02-23 22:45:33
(2 rows)
从pgpool 来看也是把Redis02主机上的postgressql数据库直接从standby转成primary

启动down机的主节点；启动起来以后转成standby
[postgres@redis01 pgpool]$ pg_ctl start
waiting for server to start....2020-02-23 22:58:00.891 CST [9600] LOG:  listening on IPv4 address "192.168.20.201", port 5432
2020-02-23 22:58:00.897 CST [9600] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-02-23 22:58:00.912 CST [9600] LOG:  redirecting log output to logging collector process
2020-02-23 22:58:00.912 CST [9600] HINT:  Future log output will appear in directory "log".
 done
server started

[postgres@redis02 ~]$ psql -c 'SELECT client_addr,application_name,sync_state FROM pg_stat_replication;'
  client_addr   | application_name | sync_state 
----------------+------------------+------------
 192.168.20.201 | walreceiver      | sync
(1 row)

down机的节点加入pgpool集群
[postgres@redis01 pgpool]$ pcp_attach_node -d -U postgres -h 192.168.20.205 -p 9898 -n 0
Password: 
DEBUG: recv: tos="m", len=8
DEBUG: recv: tos="r", len=21
DEBUG: send: tos="C", len=6
DEBUG: recv: tos="c", len=20
pcp_attach_node -- Command Successful
DEBUG: send: tos="X", len=4
[postgres@redis01 pgpool]$ psql -h 192.168.20.205 -p 9999 -c 'show pool_nodes'
 node_id | hostname | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | redis01  | 5432 | up     | 0.500000  | standby | 0          | true              | 0                 |                   |                        | 2020-02-23 23:00:11
 1       | redis02  | 5432 | up     | 0.500000  | primary | 0          | false             | 0                 |                   |                        | 2020-02-23 22:45:33
(2 rows)

到此灾备演练完毕。
本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。