集群备库宕机后，检查与恢复的步骤

原创

瀚高实验室 2021-07-23 16:12:49 ©著作权

文章标签 database 文章分类 Java 后端开发

©著作权归作者所有：来自51CTO博客作者瀚高实验室的原创作品，请联系作者获取转载授权，否则将追究法律责任

环境

系统平台：Linux x86-64 Red Hat Enterprise Linux 7,中标麒麟（CPU龙芯）7

版本：4.3.4.8,4.3.4

症状

现有集群环境下，怀疑备库宕机，可通过如下方法检查：

备库执行命令ps -ef |grep post，如无如下红色字样，则表示备库宕机。

[root@node2 ~]# ps -ef |grep post

root 11796 11682 0 16:45 pts/0 00:00:00 grep --color=auto post

root 19592 1 0 Jul28 ? 00:00:14 /opt/HighGo4.3.4.8-see/bin/postgres

root 19593 19592 0 Jul28 ? 00:00:00 postgres: logger process

root 19595 19592 0 Jul28 ? 00:00:00 postgres: checkpointer process

root 19596 19592 0 Jul28 ? 00:00:11 postgres: writer process

root 19597 19592 0 Jul28 ? 00:00:11 postgres: wal writer process

root 19598 19592 0 Jul28 ? 00:00:40 postgres: autovacuum launcher process

root 19599 19592 0 Jul28 ? 00:00:03 postgres: archiver process

root 19600 19592 0 Jul28 ? 00:00:36 postgres: stats collector process

root 19601 19592 0 Jul28 ? 00:05:58 postgres: bgworker: highgo cron

root 19602 19592 0 Jul28 ? 00:00:00 postgres: bgworker: logical replication launcher

问题原因

备库宕机的可能原因有很多，例如系统资源争用，内存不足，物理硬件故障，人为误操作，软件本身bug等等。需结合数据库日志和操作系统日志进行进一步判断。

数据库日志一般存在于数据库data目录下的hgdb_log或log中。

可通过psql命令连接数据库后执行如下操作进行确认：

highgo=# show log_directory ;

log_directory

---------------

hgdb_log

(1 row)

解决方案

首先执行如下命令

[root@node2 data]# pkill repmgrd

尝试直接启动数据库备库。

[root@node1 ~]# pg_ctl start

waiting for server to start....2020-08-03 16:57:53.978 CST [12154] LOG: listening on IPv4 address "0.0.0.0", port 5866

2020-08-03 16:57:53.978 CST [12154] LOG: listening on IPv6 address "::", port 5866

2020-08-03 16:57:53.987 CST [12154] LOG: listening on Unix socket "/tmp/.s.PGSQL.5866"

2020-08-03 16:57:54.017 CST [12154] LOG: redirecting log output to logging collector process

2020-08-03 16:57:54.017 CST [12154] HINT: Future log output will appear in directory "hgdb_log".

done

server started

此处看到有提示server started，即代表启动成功。

接下来查询集群状态，在任意一节点执行如下系统命令：

[root@node1 ~]# repmgr cluster show

----+-------+---------+-----------+----------+----------+----------+-----------------+-------------------

标红处为0字节即为正常，备库恢复完毕，直接执行本文底部第5节启动守护进程即可。如果多次查询该值超过50M且无变化或增大，则说明主库上的wal日志被移除，备库断档，需要通过如下方式进行恢复。

1先停掉选中的备库，并移动原数据目录。

[root@node2 ~]$ pg_ctl stop

[root@node2 HighGo4.3.4.8-see]$ mv data data1

2使用pg_basebackup或repmgr standby clone进行重做,node1是主库主机名，也可以写主库ip

[root@node2 HighGo4.3.4.8-see]$ pg_basebackup -h node1 -U sysdba -D $PGDATA -Fp -Xs -v -R

或

[root@node2 HighGo4.3.4.8-see]$ repmgr standby clone -h node1 -U sysdba -d highgo

Password:

pg_basebackup: initiating base backup, waiting for checkpoint to complete

pg_basebackup: checkpoint completed

pg_basebackup: write-ahead log start point: 0/A5000028 on timeline 27

pg_basebackup: starting background WAL receiver

pg_basebackup: write-ahead log end point: 0/A50000F8

pg_basebackup: waiting for background process to finish streaming ...

pg_basebackup: base backup completed

3启动数据库

[root@node2 data]$ pg_ctl start

waiting for server to start....2020-04-29 14:16:50.403 HKT [8856] LOG: listening on IPv4 address "0.0.0.0", port 5866

2020-04-29 14:16:50.403 HKT [8856] LOG: listening on IPv6 address "::", port 5866

2020-04-29 14:16:50.405 HKT [8856] LOG: listening on Unix socket "/tmp/.s.PGSQL.5866"

2020-04-29 14:16:50.491 HKT [8856] LOG: redirecting log output to logging collector process

2020-04-29 14:16:50.491 HKT [8856] HINT: Future log output will appear in directory "hgdb_log".

done

server started

4检查集群状态,恢复正常

[root@node2 ~]# repmgr cluster show

----+-------+---------+-----------+----------+----------+----------+-----------------+-------------------

5启动守护进程

[root@node2 data]# repmgrd -d

更多详细信息请登录【瀚高技术支持平台】查看https://support.highgo.com/#/index/docContent/4e62e0a902d9bc01

上一篇：如何查看数据库是否开启了加密模式

下一篇：瀚高数据库并行导入导出

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯