现有集群环境下,怀疑备库宕机,可通过如下方法检查:
备库执行命令ps -ef |grep post,如无如下红色字样,则表示备库宕机。
[root@node2 ~]# ps -ef |grep post root 11796 11682 0 16:45 pts/0 00:00:00 grep --color=auto post root 19592 1 0 Jul28 ? 00:00:14 /opt/HighGo4.3.4.8-see/bin/postgres root 19593 19592 0 Jul28 ? 00:00:00 postgres: logger process root 19595 19592 0 Jul28 ? 00:00:00 postgres: checkpointer process root 19596 19592 0 Jul28 ? 00:00:11 postgres: writer process root 19597 19592 0 Jul28 ? 00:00:11 postgres: wal writer process root 19598 19592 0 Jul28 ? 00:00:40 postgres: autovacuum launcher process root 19599 19592 0 Jul28 ? 00:00:03 postgres: archiver process root 19600 19592 0 Jul28 ? 00:00:36 postgres: stats collector process root 19601 19592 0 Jul28 ? 00:05:58 postgres: bgworker: highgo cron root 19602 19592 0 Jul28 ? 00:00:00 postgres: bgworker: logical replication launcher |
问题原因
备库宕机的可能原因有很多,例如系统资源争用,内存不足,物理硬件故障,人为误操作,软件本身bug等等。需结合数据库日志和操作系统日志进行进一步判断。
数据库日志一般存在于数据库data目录下的hgdb_log或log中。
可通过psql命令连接数据库后执行如下操作进行确认:
highgo=# show log_directory ; log_directory --------------- hgdb_log (1 row) |
解决方案
首先执行如下命令
[root@node2 data]# pkill repmgrd |
尝试直接启动数据库备库。
[root@node1 ~]# pg_ctl start waiting for server to start....2020-08-03 16:57:53.978 CST [12154] LOG: listening on IPv4 address "0.0.0.0", port 5866 2020-08-03 16:57:53.978 CST [12154] LOG: listening on IPv6 address "::", port 5866 2020-08-03 16:57:53.987 CST [12154] LOG: listening on Unix socket "/tmp/.s.PGSQL.5866" 2020-08-03 16:57:54.017 CST [12154] LOG: redirecting log output to logging collector process 2020-08-03 16:57:54.017 CST [12154] HINT: Future log output will appear in directory "hgdb_log". done server started |
此处看到有提示server started,即代表启动成功。
接下来查询集群状态,在任意一节点执行如下系统命令:
[root@node1 ~]# repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Replication lag | Last replayed LSN ----+-------+---------+-----------+----------+----------+----------+-----------------+------------------- 1 | node1 | primary | * running | | default | 100 | n/a | none 2 | node2 | standby | running | node1 | default | 100 | 0 bytes | 0/EE001A78 |
标红处为0字节即为正常,备库恢复完毕,直接执行本文底部第5节启动守护进程即可。如果多次查询该值超过50M且无变化或增大,则说明主库上的wal日志被移除,备库断档,需要通过如下方式进行恢复。
1先停掉选中的备库,并移动原数据目录。
[root@node2 ~]$ pg_ctl stop [root@node2 HighGo4.3.4.8-see]$ mv data data1 |
2使用pg_basebackup或repmgr standby clone进行重做,node1是主库主机名,也可以写主库ip
[root@node2 HighGo4.3.4.8-see]$ pg_basebackup -h node1 -U sysdba -D $PGDATA -Fp -Xs -v -R 或 [root@node2 HighGo4.3.4.8-see]$ repmgr standby clone -h node1 -U sysdba -d highgo
Password:
pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 0/A5000028 on timeline 27 pg_basebackup: starting background WAL receiver pg_basebackup: write-ahead log end point: 0/A50000F8 pg_basebackup: waiting for background process to finish streaming ... pg_basebackup: base backup completed |
3启动数据库
[root@node2 data]$ pg_ctl start
waiting for server to start....2020-04-29 14:16:50.403 HKT [8856] LOG: listening on IPv4 address "0.0.0.0", port 5866 2020-04-29 14:16:50.403 HKT [8856] LOG: listening on IPv6 address "::", port 5866 2020-04-29 14:16:50.405 HKT [8856] LOG: listening on Unix socket "/tmp/.s.PGSQL.5866" 2020-04-29 14:16:50.491 HKT [8856] LOG: redirecting log output to logging collector process 2020-04-29 14:16:50.491 HKT [8856] HINT: Future log output will appear in directory "hgdb_log". done server started |
4检查集群状态,恢复正常
[root@node2 ~]# repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Replication lag | Last replayed LSN ----+-------+---------+-----------+----------+----------+----------+-----------------+------------------- 1 | node1 | primary | * running | | default | 100 | n/a | none 2 | node2 | standby | running | node1 | default | 100 | 0 bytes | 0/EE001A77 |
5启动守护进程
[root@node2 data]# repmgrd -d |