环境描述

1.OS

CentOS Linux release 7.2.1511 (Core) X64

2.PostgreSQL

PostgreSQL 9.6.1

3.pg_rman

pg_rman-1.3.3-pg96.tar.gz v1.3.3

注意:请下载版本对应的源码包。

https://github.com/ossc-db/pg_rman/releases/download/v1.3.3/pg_rman-1.3.3-pg96.tar.gz

pg_rman-1.3.3.tar.gz(此源码编译过程中报错)

系统包

zlib-devel


二、pg_rman安装

1.安装pg_rman

root用户登录

export PATH=/opt/pgsql/9.6.1/bin:$PATH

export LD_LIBRARY_PATH=/opt/pgsql/9.6.1/lib

export MANPATH=/opt/pgsql/9.6.1/share/man:$MANPATH


# tar zxvf pg_rman-9_6_STABLE.tar.gz

# cd pg_rman-9_6_STABLE/

# make 

......

......

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 backup.o catalog.o data.o delete.o dir.o init.o parray.o pg_rman.o restore.o show.o util.o validate.o xlog.o pgsql_src/pg_ctl.o pgut/pgut.o pgut/pgut-port.o -L/opt/pgsql/9.6.1/lib -lpgcommon -lpgport -L/opt/pgsql/9.6.1/lib -lpq -L/opt/pgsql/9.6.1/lib -Wl,--as-needed -Wl,-rpath,'/opt/pgsql/9.6.1/lib',--enable-new-dtags  -lpgcommon -lpgport -lz -lreadline -lrt -lcrypt -ldl -lm -o pg_rman

# make install

/usr/bin/mkdir -p '/opt/pgsql/9.6.1/bin'

/usr/bin/install -c  pg_rman '/opt/pgsql/9.6.1/bin'

2.安装验证

su - postgres

$ pg_rman --version

pg_rman 1.3.3


3.配置数据库参数

wal_level = replica

archive_mode = on

archive_command = 'test ! -f /pg_arclog/%f && cp %p /pg_arclog/%f'

--- root user

mkdir /backup_pg_rman /pg_arclog 

chown -R postgres:postgres /backup_pg_rman

chown -R postgres:postgres /pg_arclog

--- postgresql

# pg_rman init -B $backup_dir 


三、备份恢复测试

1.备份数据(full<0> + incremental<1>)

# full

export PGDATA=/pgdata96

export BACKUP_PATH=/backup_pg_rman


$ echo $PGDATA

/pgdata96

$ echo $BACKUP_PATH

/backup_pg_rman

--- init backup dir: pg_rman init -B $backup_dir -D $PGDATA(当不配置环境变量时,手工指定,注意路径末尾不添加'/'结束符)

$ pg_rman init

INFO: ARCLOG_PATH is set to '/pg_arclog'

INFO: SRVLOG_PATH is set to '/pgdata96/pg_log'

$  


$ cat $BACKUP_PATH/pg_rman.ini

ARCLOG_PATH='/pg_arclog'

SRVLOG_PATH='/pgdata96/pg_log'


--- full backup

$ pg_rman backup --backup-mode=full --with-serverlog --progress

INFO: copying database files

Processed 1172 of 1172 files, skipped 0

INFO: copying archived WAL files

Processed 3 of 3 files, skipped 0

INFO: copying server log files

Processed 4 of 4 files, skipped 0

INFO: backup complete

INFO: Please execute 'pg_rman validate' to verify the files are correctly copied.


--- validate backup

$ pg_rman validate, status: done

INFO: validate: "2017-03-06 16:43:39" backup, archive log files and server log files by CRC

INFO: backup "2017-03-06 16:43:39" is valid


--- show backup, status: ok 

$ pg_rman show

==========================================================

 StartTime           Mode  Duration    Size   TLI  Status 

==========================================================

2017-03-06 16:43:39  FULL        0m    58MB     1  OK

--- incremental

$ pg_rman backup --backup-mode=incremental --with-serverlog --progress

INFO: copying database files

Processed 1172 of 1172 files, skipped 1115

INFO: copying archived WAL files

Processed 48 of 48 files, skipped 3

INFO: copying server log files

Processed 4 of 4 files, skipped 3

INFO: backup complete

INFO: Please execute 'pg_rman validate' to verify the files are correctly copied.

--- validate backup

$ pg_rman validate

INFO: validate: "2017-03-06 17:04:45" backup, archive log files and server log files by CRC

INFO: backup "2017-03-06 17:04:45" is valid

--- show, status: ok

$ pg_rman show detail

============================================================================================================

 StartTime           Mode  Duration    Data  ArcLog  SrvLog   Total  Compressed  CurTLI  ParentTLI  Status  

============================================================================================================

2017-03-06 17:04:45  INCR        0m   401MB   738MB    27kB  1136MB       false       1          0  OK

2017-03-06 16:43:39  FULL        0m    30MB    33MB   206kB    58MB       false       1          0  OK


2.模拟灾难恢复


1).删除PGDATA 目录下所有文件

安全停止数据库,删除文件

$ pg_ctl stop -m immediate -D /pgdata96/

$ cd /pgdata96

$ rm -rf *.*


2).恢复备份

--- postgres user

$ export PGDATA=/pgdata96

$ export BACKUP_PATH=/backup_pg_rman

$ pg_rman restore

WARNING: pg_controldata file "/pgdata96/global/pg_control" does not exist

WARNING: pg_controldata file "/pgdata96/global/pg_control" does not exist

INFO: the recovery target timeline ID is not given

INFO: use timeline ID of latest full backup as recovery target: 1

INFO: calculating timeline branches to be used to recovery target point

INFO: searching latest full backup which can be used as restore start point

INFO: found the full backup can be used as base in recovery: "2017-03-06 16:43:39"

INFO: copying online WAL files and server log files

INFO: clearing restore destination

INFO: validate: "2017-03-06 16:43:39" backup, archive log files and server log files by SIZE

INFO: backup "2017-03-06 16:43:39" is valid

INFO: restoring database files from the full mode backup "2017-03-06 16:43:39"

INFO: searching incremental backup to be restored

INFO: validate: "2017-03-06 17:04:45" backup, archive log files and server log files by SIZE

INFO: backup "2017-03-06 17:04:45" is valid

INFO: restoring database files from the incremental mode backup "2017-03-06 17:04:45"

INFO: searching backup which contained archived WAL files to be restored

INFO: backup "2017-03-06 17:04:45" is valid

INFO: restoring WAL files from backup "2017-03-06 17:04:45"

INFO: restoring online WAL files and server log files

INFO: generating recovery.conf

INFO: restore complete

HINT: Recovery will start automatically when the PostgreSQL server is started.


3).启动数据库验证数据

# /etc/init.d/postgresql start

Starting PostgreSQL: ok

切换至postgres用户,然后验证数据



基于时间点恢复

建立测试数据

testdb=# create table tbl(id int primary key, first varchar(20),second varchar(20));

CREATE TABLE

testdb=# INSERT INTO tbl VALUES(generate_series(1,1000000), 'first'||(random()*(10^3))::integer, 'second'||(random()*(10^3))::integer);

INSERT 0 1000000

testdb=#


建立全备份

--- postgres user

$ pg_rman backup --backup-mode=full --with-serverlog --progress

INFO: copying database files

Processed 1172 of 1172 files, skipped 0

INFO: copying archived WAL files

Processed 27 of 27 files, skipped 0

INFO: copying server log files

Processed 1 of 1 files, skipped 0

INFO: backup complete

INFO: Please execute 'pg_rman validate' to verify the files are correctly copied.

$ pg_rman show

==========================================================

 StartTime           Mode  Duration    Size   TLI  Status 

==========================================================

2017-03-07 16:57:33  FULL        0m   433MB     4  DONE

$ pg_rman validate

INFO: validate: "2017-03-07 16:57:33" backup, archive log files and server log files by CRC

INFO: backup "2017-03-07 16:57:33" is valid

[postgres@localhost ~]$ pg_rman show

==========================================================

 StartTime           Mode  Duration    Size   TLI  Status 

==========================================================

2017-03-07 16:57:33  FULL        0m   433MB     4  OK

$


drop 表

testdb=# drop table tbl;

DROP TABLE

testdb=# \q


停止数据库

--- root user

# /etc/init.d/postgresql stop


恢复数据库到指定时间

$ pg_rman restore --recovery-target-time '2017-03-07 16:58:33'

INFO: the recovery target timeline ID is not given

INFO: use timeline ID of current database cluster as recovery target: 4

INFO: calculating timeline branches to be used to recovery target point

INFO: searching latest full backup which can be used as restore start point

INFO: found the full backup can be used as base in recovery: "2017-03-07 16:57:33"

INFO: copying online WAL files and server log files

INFO: clearing restore destination

INFO: validate: "2017-03-07 16:57:33" backup, archive log files and server log files by SIZE

INFO: backup "2017-03-07 16:57:33" is valid

INFO: restoring database files from the full mode backup "2017-03-07 16:57:33"

INFO: searching incremental backup to be restored

INFO: searching backup which contained archived WAL files to be restored

INFO: backup "2017-03-07 16:57:33" is valid

INFO: restoring WAL files from backup "2017-03-07 16:57:33"

INFO: restoring online WAL files and server log files

INFO: generating recovery.conf

INFO: restore complete

HINT: Recovery will start automatically when the PostgreSQL server is started.


启动数据库

--- root user

# /etc/init.d/postgresql start


验证数据

--- postgres user

$ psql testdb

psql (9.6.1)

Type "help" for help.


testdb=# \dt

        List of relations

 Schema | Name | Type  |  Owner   

--------+------+-------+----------

 public | tbl  | table | postgres

(1 row)


testdb=# select count(*) from tbl;

  count  

---------

 1000000

(1 row)


testdb=# \q




异常停止数据恢复

描述:当数据库没有成功执行检查点完成,恢复时可能会丢失数据,错误排查

现象:启动数据库失败时

$ more postgresql-Mon.log 

2017-03-06 17:20:47 CST [3240]: [1-1] user=,db= LOG:  database system was interrupted; last known up at 2017-03-06 17:04:51 CST

2017-03-06 17:20:47 CST [3240]: [2-1] user=,db= LOG:  starting archive recovery

2017-03-06 17:20:47 CST [3240]: [3-1] user=,db= LOG:  invalid primary checkpoint record

2017-03-06 17:20:47 CST [3240]: [4-1] user=,db= LOG:  invalid secondary checkpoint record

2017-03-06 17:20:47 CST [3240]: [5-1] user=,db= PANIC:  could not locate a valid checkpoint record

2017-03-06 17:20:47 CST [3238]: [3-1] user=,db= LOG:  startup process (PID 3240) was terminated by signal 6: Aborted

2017-03-06 17:20:47 CST [3238]: [4-1] user=,db= LOG:  aborting startup due to startup process failure

2017-03-06 17:20:47 CST [3238]: [5-1] user=,db= LOG:  database system is shut down

2017-03-06 17:21:23 CST [3269]: [1-1] user=,db= LOG:  database system was interrupted; last known up at 2017-03-06 17:04:51 CST

2017-03-06 17:21:23 CST [3269]: [2-1] user=,db= LOG:  starting archive recovery

2017-03-06 17:21:23 CST [3269]: [3-1] user=,db= LOG:  invalid primary checkpoint record

2017-03-06 17:21:23 CST [3269]: [4-1] user=,db= LOG:  invalid secondary checkpoint record

2017-03-06 17:21:23 CST [3269]: [5-1] user=,db= PANIC:  could not locate a valid checkpoint record

2017-03-06 17:21:23 CST [3267]: [3-1] user=,db= LOG:  startup process (PID 3269) was terminated by signal 6: Aborted

2017-03-06 17:21:23 CST [3267]: [4-1] user=,db= LOG:  aborting startup due to startup process failure

2017-03-06 17:21:23 CST [3267]: [5-1] user=,db= LOG:  database system is shut down


处理步骤说明:

重置事务日志

仅保留备份时数据

$ pg_resetxlog -f /pgdata96

Transaction log reset

然后启动数据库,验证部分数据