环境
系统平台:Linux x86-64 Red Hat Enterprise Linux 7,中标麒麟(CPU龙芯)7
版本:4.3.4.8,4.3.4
 
症状

现有集群环境下,怀疑备库宕机,可通过如下方法检查:

备库执行命令ps -ef |grep post,如无如下红色字样,则表示备库宕机。

 

[root@node2 ~]# ps -ef |grep post

root     11796 11682  0 16:45 pts/0    00:00:00 grep --color=auto post

root     19592     1  0 Jul28 ?        00:00:14 /opt/HighGo4.3.4.8-see/bin/postgres

root     19593 19592  0 Jul28 ?        00:00:00 postgres: logger process   

root     19595 19592  0 Jul28 ?        00:00:00 postgres: checkpointer process   

root     19596 19592  0 Jul28 ?        00:00:11 postgres: writer process   

root     19597 19592  0 Jul28 ?        00:00:11 postgres: wal writer process   

root     19598 19592  0 Jul28 ?        00:00:40 postgres: autovacuum launcher process  

root     19599 19592  0 Jul28 ?        00:00:03 postgres: archiver process   

root     19600 19592  0 Jul28 ?        00:00:36 postgres: stats collector process  

root     19601 19592  0 Jul28 ?        00:05:58 postgres: bgworker: highgo cron   

root     19602 19592  0 Jul28 ?        00:00:00 postgres: bgworker: logical replication launcher  

问题原因

备库宕机的可能原因有很多,例如系统资源争用,内存不足,物理硬件故障,人为误操作,软件本身bug等等。需结合数据库日志和操作系统日志进行进一步判断。

数据库日志一般存在于数据库data目录下的hgdb_log或log中。

可通过psql命令连接数据库后执行如下操作进行确认:

 

highgo=# show log_directory ;                      

 log_directory 

---------------

 hgdb_log

(1 row)

解决方案

首先执行如下命令

[root@node2 data]# pkill repmgrd

尝试直接启动数据库备库。

[root@node1 ~]# pg_ctl start 

waiting for server to start....2020-08-03 16:57:53.978 CST [12154] LOG:  listening on IPv4 address "0.0.0.0", port 5866

2020-08-03 16:57:53.978 CST [12154] LOG:  listening on IPv6 address "::", port 5866

2020-08-03 16:57:53.987 CST [12154] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5866"

2020-08-03 16:57:54.017 CST [12154] LOG:  redirecting log output to logging collector process

2020-08-03 16:57:54.017 CST [12154] HINT:  Future log output will appear in directory "hgdb_log".

 done

server started

此处看到有提示server started,即代表启动成功。

接下来查询集群状态,在任意一节点执行如下系统命令:

 

[root@node1 ~]# repmgr cluster show 

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Replication lag | Last replayed LSN

----+-------+---------+-----------+----------+----------+----------+-----------------+-------------------

 1  | node1 | primary | * running |          | default  | 100      | n/a        | none  

 2  | node2 | standby |  running | node1      | default  | 100      | 0 bytes     | 0/EE001A78  

标红处为0字节即为正常,备库恢复完毕,直接执行本文底部第5节启动守护进程即可。如果多次查询该值超过50M且无变化或增大,则说明主库上的wal日志被移除,备库断档,需要通过如下方式进行恢复。

1先停掉选中的备库,并移动原数据目录。

[root@node2 ~]$ pg_ctl stop

[root@node2 HighGo4.3.4.8-see]$ mv data data1

 

2使用pg_basebackup或repmgr standby clone进行重做,node1是主库主机名,也可以写主库ip

[root@node2 HighGo4.3.4.8-see]$ pg_basebackup  -h node1 -U sysdba -D $PGDATA -Fp -Xs -v -R

[root@node2 HighGo4.3.4.8-see]$ repmgr standby clone -h node1 -U sysdba -d highgo

 

Password:

 

pg_basebackup: initiating base backup, waiting for checkpoint to complete

pg_basebackup: checkpoint completed

pg_basebackup: write-ahead log start point: 0/A5000028 on timeline 27

pg_basebackup: starting background WAL receiver

pg_basebackup: write-ahead log end point: 0/A50000F8

pg_basebackup: waiting for background process to finish streaming ...

pg_basebackup: base backup completed

 

3启动数据库

[root@node2 data]$ pg_ctl start

 

waiting for server to start....2020-04-29 14:16:50.403 HKT [8856] LOG:  listening on IPv4 address "0.0.0.0", port 5866

2020-04-29 14:16:50.403 HKT [8856] LOG:  listening on IPv6 address "::", port 5866

2020-04-29 14:16:50.405 HKT [8856] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5866"

2020-04-29 14:16:50.491 HKT [8856] LOG:  redirecting log output to logging collector process

2020-04-29 14:16:50.491 HKT [8856] HINT:  Future log output will appear in directory "hgdb_log".

 done

server started

 

4检查集群状态,恢复正常

 

[root@node2 ~]# repmgr cluster show 

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Replication lag | Last replayed LSN

----+-------+---------+-----------+----------+----------+----------+-----------------+-------------------

 1  | node1 | primary | * running |          | default  | 100      | n/a        | none  

 2  | node2 | standby |  running | node1      | default  | 100      | 0 bytes     | 0/EE001A77

5启动守护进程

 

[root@node2 data]# repmgrd -d
 更多详细信息请登录【瀚高技术支持平台】查看https://support.highgo.com/#/index/docContent/4e62e0a902d9bc01