postgresql-pgpool-流复制主从同步（一）---PGpool-II流复制

原创

流年_火星程序员 2020-06-12 14:28:16 博主文章分类：postgresql ©著作权

文章标签 pgool 主从 文章分类 运维

©著作权归作者所有：来自51CTO博客作者流年_火星程序员的原创作品，请联系作者获取转载授权，否则将追究法律责任

基于PGPool的双机集群如上图所示：pg主节点和备节点实现流复制热备，pgpool1，pgpool2作为中间件，将主备pg节点加入集群，实现读写分离，负载均衡和HA故障自动切换。两pgpool节点可以委托一个虚拟ip节点作为应用程序访问的地址，两节点之间通过watchdog进行监控，当pgpool1宕机时，pgpool2会自动接管虚拟ip继续对外提供不间断服务。一主机规划

主机名	IP	角色	端口
master	192.168.0.108	PGMaster	5432
192.168.0.108	pgpool1	9999
slave	192.168.0.109	PGSlave	5432
192.168.0.109	pgpool2	9999
vip	192.168.0.150	虚拟ip	9999

建立好主机规划之后，在master，slave上两台机器设置下host [root@localhost ~]# vi .bashrc #编辑内容如下： 192.168.0.108 master 192.168.0.109 slave 192.168.0.150 vip 二配置ssh秘钥在master,slave机器上都生成ssh如下： [root@localhost ~]# su - postgres [postgres@localhost ~]$ ssh-keygen -t rsa [postgres@localhost ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys [postgres@localhost ~]$ chmod 600 ~/.ssh/authorized_keys 分别将master的公钥复制到slave，slave的公钥复制到master。 #master端 [postgres@localhost ~]$ scp ~/.ssh/authorized_keys postgres@slave:~/.ssh/ #slave端 [postgres@localhost ~]$ scp ~/.ssh/authorized_keys postgres@master:~/.ssh/ 验证下ssh配置是否成功 #master端 [postgres@slave ~]$ ssh postgres@slave Last login: Tue Dec 20 21:22:50 2016 from master #slave端 [postgres@slave ~]$ ssh postgres@master Last login: Tue Dec 20 21:22:50 2016 from slave 证明ssh信任关系配置成功。三安装pgpool 中文配置地址可参考http://pgpool.projects.pgfoundry.org/pgpool-II/doc/pgpool-zh_cn.html

下载pgpool

[root@master opt]# wget http://www.pgpool.net/mediawiki/images/pgpool-II-3.6.0.tar.gz

解压

[root@master opt]# tar -zxvf pgpool-II-3.6.0.tar.gz

文件权限设置为postgres（其实并非一定装在postgres账户，只不过之前ssh设置都在postgres下，为了方便）

[root@master opt]# chown -R postgres.postgres /opt/pgpool-II-3.6.0 [root@master ~]# su - postgres [postgres@master opt]$ cd pgpool-II-3.6.0 [postgres@master pgpool-II-3.6.0]$ ./configure --prefix=/opt/pgpool -with-pgsql=path -with-pgsql=/home/postgres

error: libpq is not installed or libpq is old

缺少postgresql-devel

[postgres@master pgpool-II-3.6.0]$ make [postgres@master pgpool-II-3.6.0]$ make install 安装pgpool相关函数，并非强制，可选安装，为了系统稳定，建议安装安装pg_reclass,pg_recovery [postgres@master pgpool-II-3.6.0]$ cd src/sql [postgres@master sql]$ make [postgres@master sql]$ make install [postgres@master sql]$ psql -f insert_lock.sql 安装全部结束。四配置pgpool 4.1 配置pgpool环境变量 pgpool装在了postgres账户下，在该账户中添加环境变量，master,slave节点都执行。 [postgres@master ~]$ cd /home/postgres [postgres@master ~]$ vim .bashrc #编辑内容如下 PGPOOLHOME=/opt/pgpool export PGPOOLHOME PATH=$PATH:$HOME/.local/bin:$HOME/bin:$PGHOME/bin:$PGPOOLHOME/bin export PATH 4.2 配置pool_hba.conf pool_hba.conf是对登录用户进行验证的，要和pg的pg_hba.conf保持一致，要么都是trust，要么都是md5验证方式，这里采用了md5验证方式如下设置： [postgres@master ~]$ cd /opt/pgpool/etc [postgres@etc~]$ cp pool_hba.conf.sample pool_hba.conf [postgres@etc~]$ vim pool_hba.conf #编辑内容如下

"local" is for Unix domain socket connections only

local all all md5

IPv4 local connections:

host all all 0.0.0.0/0 md5 host all all 0/0 md5 4.3 配置pcp.conf pcp.conf配置用于pgpool自己登陆管理使用的，一些操作pgpool的工具会要求提供密码等，配置如下： [postgres@master ~]$ cd /opt/pgpool/etc [postgres@etc~]$ cp pcp.conf.sample pcp.conf

使用pg_md5生成配置的用户名密码

[postgres@etc~]$ pg_md5 nariadmin 6b07583ba8af8e03043a1163147faf6a #pcp.conf是pgpool管理器自己的用户名和密码，用于管理集群。 [postgres@etc~]$ vim pcp.conf #编辑内容如下 postgres:6b07583ba8af8e03043a1163147faf6a #保存退出！ #在pgpool中添加pg数据库的用户名和密码 [postgres@etc~]$ pg_md5 -p -m -u postgres pool_passwd #数据库登录用户是postgres,这里输入登录密码，不能出错 #输入密码后，在pgpool/etc目录下会生成一个pool_passwd文件 4.4 配置系统命令权限配置 ifconfig, arping 执行权限，执行failover_stream.sh需要用到，可以让其他普通用户执行。 [root@master ~]# chmod u+s /sbin/ifconfig [root@master ~]# chmod u+s /usr/sbin Cd /sbin && chmod +s ifup 977 2019/10/27 16:14:59: chmod +s ip 978 2019/10/27 16:15:03: chmod +s arping 4.5 配置pgpool.conf 查看本机网卡，配置后面的delegate_IP需要 [postgres@etc~]$ ifconfig

网卡名称.png 配置master上的pgpool.conf: [postgres@master ~]$ cd /opt/pgpool/etc [postgres@etc~]$ cp pgpool.conf.sample pgpool.conf [postgres@etc~]$ vim pgpool.conf 编辑内容如下：

CONNECTIONS

listen_addresses = '' port = 9999 pcp_listen_addresses = '' pcp_port = 9898

- Backend Connection Settings -

backend_hostname0 = 'master' backend_port0 = 5432 backend_weight0 = 1 backend_data_directory0 = '/home/postgres/data' backend_flag0 = 'ALLOW_TO_FAILOVER'

backend_hostname1 = 'slave' backend_port1 = 5432 backend_weight1 = 1 backend_data_directory1 = '/home/postgres/data' backend_flag1 = 'ALLOW_TO_FAILOVER'

- Authentication -

enable_pool_hba = on pool_passwd = 'pool_passwd'

FILE LOCATIONS

pid_file_name = '/opt/pgpool/pgpool.pid'

replication_mode = off load_balance_mode = on master_slave_mode = on master_slave_sub_mode = 'stream'

sr_check_period = 5 sr_check_user = 'repuser' sr_check_password = 'repuser' sr_check_database = 'postgres'

#------------------------------------------------------------------------------

HEALTH CHECK 健康检查

#------------------------------------------------------------------------------

health_check_period = 10 # Health check period # Disabled (0) by default health_check_timeout = 20 # Health check timeout # 0 means no timeout health_check_user = 'postgres' # Health check user health_check_password = 'nariadmin' #数据库密码 # Password for health check user health_check_database = 'postgres' #必须设置，否则primary数据库down了，pgpool不知道，不能及时切换。从库流复制还在连接数据，报连接失败。 #只有下次使用pgpool登录时，发现连接不上，然后报错，这时候，才知道挂了，pgpool进行切换。

#主备切换的命令行配置 #------------------------------------------------------------------------------

FAILOVER AND FAILBACK

#------------------------------------------------------------------------------

failover_command = '/opt/pgpool/failover_stream.sh %H '

#------------------------------------------------------------------------------

WATCHDOG

#------------------------------------------------------------------------------

- Enabling -

use_watchdog = on

- Watchdog communication Settings -

wd_hostname = 'master' # Host name or IP address of this watchdog # (change requires restart) wd_port = 9000 # port number for watchdog service # (change requires restart)

- Virtual IP control Setting -

delegate_IP = 'vip' # delegate IP address # If this is empty, virtual IP never bring up. # (change requires restart) if_cmd_path = '/sbin' # path to the directory where if_up/down_cmd exists # (change requires restart) if_up_cmd = 'ifconfig eth1:0 inet $IP$ netmask 255.255.255.0' # startup delegate IP command # (change requires restart) # eth1根据现场机器改掉 if_down_cmd = 'ifconfig eth1:0 down' # shutdown delegate IP command # (change requires restart) # eth1根据现场机器改掉

-- heartbeat mode --

wd_heartbeat_port = 9694 # Port number for receiving heartbeat signal # (change requires restart) wd_heartbeat_keepalive = 2 # Interval time of sending heartbeat signal (sec) # (change requires restart) wd_heartbeat_deadtime = 30 # Deadtime interval for heartbeat signal (sec) # (change requires restart) heartbeat_destination0 = 'slave' # Host name or IP address of destination 0 # for sending heartbeat signal. # (change requires restart) heartbeat_destination_port0 = 9694 # Port number of destination 0 for sending # heartbeat signal. Usually this is the # same as wd_heartbeat_port. # (change requires restart) heartbeat_device0 = 'eth1' # Name of NIC device (such like 'eth0') # used for sending/receiving heartbeat # signal to/from destination 0. # This works only when this is not empty # and pgpool has root privilege. # (change requires restart) # eth1根据现场机器改掉

- Other pgpool Connection Settings -

other_pgpool_hostname0 = 'slave' #对端 # Host name or IP address to connect to for other pgpool 0 # (change requires restart) other_pgpool_port0 = 9999 # Port number for othet pgpool 0 # (change requires restart) other_wd_port0 = 9000 # Port number for othet watchdog 0 # (change requires restart)

配置slave上的pgpool.conf:

CONNECTIONS

listen_addresses = '' port = 9999 pcp_listen_addresses = '' pcp_port = 9898

- Backend Connection Settings -

backend_hostname0 = 'master' backend_port0 = 5432 backend_weight0 = 1 backend_data_directory0 = '/home/postgres/data' backend_flag0 = 'ALLOW_TO_FAILOVER'

backend_hostname1 = 'slave' backend_port1 = 5432 backend_weight1 = 1 backend_data_directory1 = '/home/postgres/data' backend_flag1 = 'ALLOW_TO_FAILOVER'

- Authentication -

enable_pool_hba = on pool_passwd = 'pool_passwd'

FILE LOCATIONS

pid_file_name = '/opt/pgpool/pgpool.pid'

replication_mode = off load_balance_mode = on master_slave_mode = on master_slave_sub_mode = 'stream'

sr_check_period = 5 sr_check_user = 'repuser' sr_check_password = 'repuser' sr_check_database = 'postgres'

#------------------------------------------------------------------------------

HEALTH CHECK 健康检查

#------------------------------------------------------------------------------

#主备切换的命令行配置 #------------------------------------------------------------------------------

FAILOVER AND FAILBACK

#------------------------------------------------------------------------------

failover_command = '/opt/pgpool/failover_stream.sh %H '

#------------------------------------------------------------------------------

WATCHDOG

#------------------------------------------------------------------------------

- Enabling -

use_watchdog = on

- Watchdog communication Settings -

wd_hostname = 'slave' #本端 # Host name or IP address of this watchdog # (change requires restart) wd_port = 9000 # port number for watchdog service # (change requires restart)

- Virtual IP control Setting -

-- heartbeat mode --

wd_heartbeat_port = 9694 # Port number for receiving heartbeat signal # (change requires restart) wd_heartbeat_keepalive = 2 # Interval time of sending heartbeat signal (sec) # (change requires restart) wd_heartbeat_deadtime = 30 # Deadtime interval for heartbeat signal (sec) # (change requires restart) heartbeat_destination0 = 'master' #对端 # Host name or IP address of destination 0 # for sending heartbeat signal. # (change requires restart) heartbeat_destination_port0 = 9694 # Port number of destination 0 for sending # heartbeat signal. Usually this is the # same as wd_heartbeat_port. # (change requires restart) heartbeat_device0 = 'eth1' # Name of NIC device (such like 'eth0') # used for sending/receiving heartbeat # signal to/from destination 0. # This works only when this is not empty # and pgpool has root privilege. # (change requires restart) # eth1根据现场机器改掉

- Other pgpool Connection Settings -

other_pgpool_hostname0 = 'master' #对端 # Host name or IP address to connect to for other pgpool 0 # (change requires restart) other_pgpool_port0 = 9999 # Port number for othet pgpool 0 # (change requires restart) other_wd_port0 = 9000 # Port number for othet watchdog 0 # (change requires restart)

配置文件里，故障处理配置的是failover_command = '/opt/pgpool/failover_stream.sh %H '，因此，需要在/opt/pgpool目录中写个failover_stream.sh脚本： [postgres@master ~]$ cd /opt/pgpool [postgres@pgpool~]$ touch failover_stream.sh [postgres@pgpool~]$ vim failover_stream.sh 注意这里使用了promote 而不是触发文件，触发文件来回切换有问题，编辑内容如下： #! /bin/sh

Failover command for streaming replication.

Arguments: $1: new master hostname.

new_master=$1 trigger_command="$PGHOME/bin/pg_ctl promote -D $PGDATA"

Prompte standby database.

/usr/bin/ssh -T $new_master $trigger_command

exit 0; 如果是其他用户创建的，需要赋予postgres可执行权限，例如 [root@opt ~]$ chown -R postgres.postgres /opt/pgpool [root@opt ~]]$ chmod 777 /opt/pgpool/failover_stream.sh 五 PGPool集群管理启动之前在master,slave节点创建两个日志文件： [root@master ~]# mkdir /var/log/pgpool [root@master ~]# chown -R postgres.postgres /var/log/pgpool [root@master ~]# mkdir /var/run/pgpool [root@master ~]# chown -R postgres.postgres /var/run/pgpool 5.1 启动集群分别启动primary,standby的pg库 #master上操作 [postgres@master ~]$ pg_ctl start -D $PGDATA #slave上操作 [postgres@slave ~]$ pg_ctl start -D $PGDATA 分别启动pgpool命令： #master上操作

-D会重新加载pg nodes的状态如down或up

[postgres@master ~]$ pgpool -n -d -D > /var/log/pgpool/pgpool.log 2>&1 & [1] 3557

#slave上操作 [postgres@slave ~]$ pgpool -n -d -D > /var/log/pgpool/pgpool.log 2>&1 & [1] 3557 注意快速终止pgpool命令： [postgres@ ~]$ pgpool -m fast stop 启动pgpool后，查看集群节点状态: [postgres@master ~]$ psql -h vip -p 9999 psql (9.6.1) #提示输入密码： Type "help" for help.

#在slave上节点也是psql -h vip -p 9999，双pgpool使用虚拟ip，做到高可用。发现当前主备节点都是正常的up状态。 5.2 Pgpool的HA 5.2.1 模拟master端pgpool宕机在master节点上停止pgpool服务 [postgres@master ~]$ pgpool -m fast stop #稍等片刻后，访问集群 [postgres@master ~]$ psql -h vip -p 9999 psql (9.6.1) #提示输入密码： Type "help" for help.

postgres=# show pool_nodes; node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay ---------+----------+------+--------+-----------+---------+------------+-------------------+------------------- 0 | master | 5432 | up | 0.500000 | primary | 0 | false | 0 1 | slave | 5432 | up | 0.500000 | standby | 0 | true | 0 (2 rows) #访问成功，在master节点上的pgpool宕机后，由slave节点的pgpool接管vip和集群服务，并未中断应用访问。 #在master上重新启动pgpool后，定制slave上的pgpool服务，结果一样。 5.2.2模拟master端pg primary宕机 [postgres@master ~]$ pg_ctl stop #master端打印 2017-07-24 18:52:37.751 PDT [28154] STATEMENT: SELECT pg_current_xlog_location() 2017-07-24 18:52:37.760 PDT [2553] LOG: received fast shutdown request 2017-07-24 18:52:37.760 PDT [2553] LOG: aborting any active transactions 2017-07-24 18:52:37.762 PDT [28156] FATAL: canceling authentication due to timeout 2017-07-24 18:52:37.763 PDT [2555] LOG: shutting down 2017-07-24 18:52:37.768 PDT [28158] FATAL: the database system is shutting down 2017-07-24 18:52:37.775 PDT [28159] FATAL: the database system is shutting down 2017-07-24 18:52:39.653 PDT [2553] LOG: database system is shut down

#slave端打印 2017-07-24 18:52:41.455 PDT [2614] LOG: invalid record length at 0/2A000098: wanted 24, got 0 2017-07-24 18:52:47.333 PDT [2614] LOG: received promote request 2017-07-24 18:52:47.333 PDT [2614] LOG: redo done at 0/2A000028 2017-07-24 18:52:47.333 PDT [2614] LOG: last completed transaction was at log time 2017-07-24 18:17:00.946759-07 2017-07-24 18:52:47.336 PDT [2614] LOG: selected new timeline ID: 10 2017-07-24 18:52:47.841 PDT [2614] LOG: archive recovery complete 2017-07-24 18:52:47.851 PDT [2613] LOG: database system is ready to accept connections

#日志清楚看到主机down机了，slave切换了。 #稍等片刻后，访问集群 [postgres@master ~]$ psql -h vip -p 9999 Password: psql (10beta1) Type "help" for help.

5.3 数据线同步在主备切换时，修复节点并重启后，由于primary数据发生变化，或修复的节点数据发生变化再按照流复制模式加入集群，很可能报时间线不同步错误： #slave机器重启后，由于master或slave数据不同步产生了 [postgres@slave data]$ mv recovery.done recovery.conf [postgres@slave data]$ pg_ctl start waiting for server to start....2017-07-24 19:31:44.563 PDT [2663] LOG: listening on IPv4 address "0.0.0.0", port 5432 2017-07-24 19:31:44.563 PDT [2663] LOG: listening on IPv6 address "::", port 5432 2017-07-24 19:31:44.565 PDT [2663] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2017-07-24 19:31:44.584 PDT [2664] LOG: database system was shut down at 2017-07-24 19:31:30 PDT 2017-07-24 19:31:44.618 PDT [2664] LOG: entering standby mode 2017-07-24 19:31:44.772 PDT [2664] LOG: consistent recovery state reached at 0/2D000098 2017-07-24 19:31:44.772 PDT [2663] LOG: database system is ready to accept read only connections 2017-07-24 19:31:44.772 PDT [2664] LOG: invalid record length at 0/2D000098: wanted 24, got 0 2017-07-24 19:31:44.798 PDT [2668] LOG: fetching timeline history file for timeline 11 from primary server 2017-07-24 19:31:44.826 PDT [2668] FATAL: could not start WAL streaming: ERROR: requested starting point 0/2D000000 on timeline 10 is not in this server's history DETAIL: This server's history forked from timeline 10 at 0/2B0001B0. 2017-07-24 19:31:44.826 PDT [2664] LOG: new timeline 11 forked off current database system timeline 10 before current recovery point 0/2D000098 done

产生这种情况，需要根据pg_rewind工具同步数据时间线，具体分5步走。 5.3.1停掉需要做同步的节点pg服务 [postgres@slave ] pg_ctl stop

5.3.2 同步master节点上时间线 [postgres@slave data]$ pg_rewind --target-pgdata=/home/postgres/data --source-server='host=master port=5432 user=postgres dbname=postgres password=nariadmin' servers diverged at WAL location 0/2B0001B0 on timeline 10 rewinding from last common checkpoint at 0/2B000108 on timeline 10 Done! 5.3.3 修改pg_hba.conf与 recovery.done文件 #pg_hba.conf与 recovery.done都是同步master上来的，要改成slave自己的 [postgres@slave ] cd $PGDATA [postgres@slave data]$ mv recovery.done recovery.conf [postgres@slave data]$ vi pg_hba.conf #slave改成master（相当于slave的流复制对端） host replication repuser master md5 [postgres@slave data]$ vi recovery.conf #slave改成master（相当于slave的流复制对端） primary_conninfo = 'host=master port=5432 user=repuser password=repuser'
5.3.4 重启pg服务 [postgres@slave data]$ pg_ctl start waiting for server to start....2017-07-24 19:47:06.821 PDT [2722] LOG: listening on IPv4 address "0.0.0.0", port 5432 2017-07-24 19:47:06.821 PDT [2722] LOG: listening on IPv6 address "::", port 5432 2017-07-24 19:47:06.907 PDT [2722] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2017-07-24 19:47:06.930 PDT [2723] LOG: database system was interrupted while in recovery at log time 2017-07-24 19:25:42 PDT 2017-07-24 19:47:06.930 PDT [2723] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2017-07-24 19:47:06.961 PDT [2723] LOG: entering standby mode 2017-07-24 19:47:06.966 PDT [2723] LOG: redo starts at 0/2B0000D0 2017-07-24 19:47:06.971 PDT [2723] LOG: consistent recovery state reached at 0/2B01CA30 2017-07-24 19:47:06.972 PDT [2722] LOG: database system is ready to accept read only connections 2017-07-24 19:47:06.972 PDT [2723] LOG: invalid record length at 0/2B01CA30: wanted 24, got 0 2017-07-24 19:47:06.982 PDT [2727] LOG: started streaming WAL from primary at 0/2B000000 on timeline 11 done server started 5.3.5 重新加入集群 #注意slave的node_id是1，所以-n 1 [postgres@slave data]$ pcp_attach_node -d -U postgres -h vip -p 9898 -n 1 Password: #提示输入密码，输入pcp管理密码。 DEBUG: recv: tos="m", len=8 DEBUG: recv: tos="r", len=21 DEBUG: send: tos="C", len=6 DEBUG: recv: tos="c", len=20 pcp_attach_node -- Command Successful DEBUG: send: tos="X", len=4 5.3.6 查看集群节点状态 [postgres@slave data]$ psql -h vip -p 9999 Password: psql (10beta1) Type "help" for help.

[all servers]# su - postgres [all servers]$ mkdir /var/lib/pgsql/archivedir

Then we edit the configuration file $PGDATA/postgresql.conf on server1 (primary) as follows.

listen_addresses = '*' archive_mode = on archive_command = 'cp "%p" "/var/lib/pgsql/archivedir/%f"' max_wal_senders = 10 max_replication_slots = 10 wal_level = replica