背景

生产上选用repmgr给PostgreSQL数据库做高可用集群,在给生产上一套库做高可用改造时发现standby clone时报错,无法复制备库,报错内容如下:
repmgr配置备库报错File exists处理一例

原因

先说原因,是因为对PG和pg_basebackup比较了解的同学可能自己就可以想出解决方案,不需要再继续往下看了。原因是由于创建的独立表空间指定的目录放在$PGDATA目录下,repmgr的standby clone调用的是pg_basebackup,而且没有指定输出格式,默认为plain,会复制主库目录时把PGDATA目录下所有文件、目录和独立表空间目录,所以会报错File exists

解决方案

  1. 迁移主库独立表空间到PGDATA以外的目录(会阻塞写)
  2. 指定新目录做standby clone,clone完后把把文件移到实际PGDATA目录

由于方案一涉及到对主库做操作,不建议在生产上操作,除非不介意对应用的影响。

测试方案二

节点1操作

添加表空间、创建database、写表

postgres=# create user pguser login  password 'pguser';
CREATE ROLE
postgres=# create tablespace tbs_mydb owner pguser location '/home/postgres/data/pg_tbs/tbs_mydb';
WARNING:  tablespace location should not be inside the data directory
CREATE TABLESPACE
postgres=# create  database mydb with owner=pguser template=template0 encoding='UTF8' tablespace =tbs_mydb;
CREATE DATABASE
postgres=# grant all on database mydb to pguser with grant option;
GRANT
postgres=# grant all on tablespace tbs_mydb to pguser;
GRANT
postgres=# \c mydb pguser
You are now connected to database "mydb" as user "pguser".
mydb=> create table t1 (id int);
CREATE TABLE
mydb=> insert into t1 values(1);
INSERT 0 1
mydb=> select * from t1;
 id 
----
  1
(1 row)

节点2操作

第一次尝试 standby clone,出现与生产上一致的报错,报错信息与生产一致

INFO: checking and correcting permissions on existing directory "/home/postgres/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  /usr/local/pgsql/bin/pg_basebackup -l "repmgr base backup"  -D /home/postgres/data -h 192.168.56.111 -p 6000 -U repmgr -X stream 
pg_basebackup: could not create directory "/home/postgres/data/pg_tbs": File exists
pg_basebackup: removing contents of data directory "/home/postgres/data"
pg_basebackup: changes to tablespace directories will not be undone
ERROR: unable to take a base backup of the source server
HINT: data directory ("/home/postgres/data") may need to be cleaned up manually

修改repmgr.conf中的data_directory='/home/postgres/repmgr'

再次尝试 standby clone,成功

[postgres@repmgr2 ~]$ repmgr -h 192.168.56.111 -U repmgr -d repmgr -f ~/repmgr.conf standby clone -p6000 
NOTICE: destination directory "/home/postgres/repmgr" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.56.111 user=repmgr port=6000 dbname=repmgr
DETAIL: current installation size is 45 MB
DEBUG: 1 node records returned by source node
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.111 port=6000 fallback_application_name=repmgr options=-csearch_path="
DEBUG: upstream_node_id determined as 111
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/home/postgres/repmgr"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  /usr/local/pgsql/bin/pg_basebackup -l "repmgr base backup"  -D /home/postgres/repmgr -h 192.168.56.111 -p 6000 -U repmgr -X stream 
DEBUG: create_recovery_file(): creating "/home/postgres/repmgr/recovery.conf"...
DEBUG: recovery.conf line: standby_mode = 'on'

DEBUG: recovery.conf line: primary_conninfo = 'host=192.168.56.111 user=repmgr port=6000 application_name=repmgr2 connect_timeout=2'

DEBUG: recovery.conf line: recovery_target_timeline = 'latest'

NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /home/postgres/repmgr start
HINT: after starting the server, you need to register this standby with "repmgr standby register"

修改repmgr.conf为原来的配置,并把repmgr目录下的所有文件mv到data目录下

data_directory='/home/postgres/data'

[postgres@repmgr2 repmgr]$ mv * ~/data/
mv: cannot move ‘pg_tbs’ to ‘/home/postgres/data/pg_tbs’: File exists

修改配置文件中的cluster_name参数并启动数据库

[postgres@repmgr2 data]$ pg_ctl -D /home/postgres/data/ start
waiting for server to start....2021-02-28 10:09:15.905 CST [3498] LOG:  listening on IPv4 address "0.0.0.0", port 6000
2021-02-28 10:09:15.912 CST [3498] LOG:  listening on Unix socket "/tmp/.s.PGSQL.6000"
2021-02-28 10:09:15.949 CST [3498] LOG:  redirecting log output to logging collector process
2021-02-28 10:09:15.949 CST [3498] HINT:  Future log output will appear in directory "log".
. done
server started

注册备库成功

[postgres@repmgr2 data]$ repmgr -f ../repmgr.conf standby register
INFO: connecting to local node "repmgr2" (ID: 113)
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.113 port=6000 fallback_application_name=repmgr options=-csearch_path="
INFO: connecting to primary database
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.111 port=6000 fallback_application_name=repmgr options=-csearch_path="
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 111)
INFO: standby registration complete
NOTICE: standby node "repmgr2" (ID: 113) successfully registered

检查集群状态

[postgres@repmgr2 data]$ repmgr -f ../repmgr.conf cluster show
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.113 port=6000 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.111 port=6000 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.113 port=6000 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.111 port=6000 fallback_application_name=repmgr options=-csearch_path="
 ID  | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                        
-----+---------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------
 111 | repmgr1 | primary | * running |          | default  | 100      | 5        | host=192.168.56.111 port=6000 user=repmgr dbname=repmgr connect_timeout=2
 113 | repmgr2 | standby |   running | repmgr1  | default  | 100      | 5        | host=192.168.56.113 port=6000 user=repmgr dbname=repmgr connect_timeout=2

测试数据同步

主库测试添加数据

mydb=> insert into t1 values(2);
INSERT 0 1
mydb=> select * from t1;
 id 
----
  1
  2
(2 rows)

从库查询

[postgres@repmgr2 data]$ psql
psql (10.11)
Type "help" for help.

postgres=# \c mydb pguser
You are now connected to database "mydb" as user "pguser".
mydb=> select * from t1;
 id 
----
  1
  2
(2 rows)

写在最后

其实在创建独立表空间时PG已经做了提示表空间不应用在DATA目录,所以出现上面的报错就是掉进了前人的坑。

WARNING:  tablespace location should not be inside the data directory

如果想尝试方案一的可以提供一下思路

#新创建一个表空间
postgres=# create tablespace zhijian  owner pguser location '/data/pgdata/11/pg_tbs/tbs_zhijian';
CREATE TABLESPACE
#更改数据库的表空间
mydb=> \c postgres postgres
postgres=# alter database mydb set  tablespace zhijian;
ALTER DATABASE