数据库数据目录下pg_wal的WAL文件在开启归档的模式下会将已归档WAL文件自动清理,具体流程是,转储WAL段文件到disk,写满或者使用pg_switch_wal()后,会生成000000xxxx.ready文件,调用archive_command命令且成功执行后,将ready文件更名为.done文件。而数据库会在执行checkpoint后计算出最旧的需保留的WAL文件,比该值更早的WAL文件均会被清理。

计算该保留的WAL文件通常由wal_keep_segments参数和复制槽位置控制。

在CreateCheckPoint里执行清理部分的代码如下:

/*
	 * Delete old log files, those no longer needed for last checkpoint to
	 * prevent the disk holding the xlog from growing full.
	 */
    //转换当前redo位置为wal文件名
	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
    //计算出当前情况下需保留的最旧wal文件
	KeepLogSeg(recptr, &_logSegNo);
	_logSegNo--;
    //从pg_wal下移除所有比_logSegNo文件号小的WAL文件
	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
/*
 * Retreat *logSegNo to the last segment that we need to retain because of
 * either wal_keep_segments or replication slots.
 *
 * This is calculated by subtracting wal_keep_segments from the given xlog
 * location, recptr and by making sure that that result is below the
 * requirement of replication slots.
 */
static void
KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
{
	XLogSegNo	segno;
	XLogRecPtr	keep;

	XLByteToSeg(recptr, segno, wal_segment_size);
	keep = XLogGetReplicationSlotMinimumLSN();

	/* compute limit for wal_keep_segments first */
	if (wal_keep_segments > 0)
	{
		/* avoid underflow, don't go below 1 */
		if (segno <= wal_keep_segments)
			segno = 1;
		else
			segno = segno - wal_keep_segments;
	}

	/* then check whether slots limit removal further */
	if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
	{
		XLogSegNo	slotSegNo;

		XLByteToSeg(keep, slotSegNo, wal_segment_size);

		if (slotSegNo <= 0)
			segno = 1;
		else if (slotSegNo < segno)
			segno = slotSegNo;
	}

	/* don't delete WAL segments newer than the calculated segment */
	if (segno < *logSegNo)
		*logSegNo = segno;
}

这里清理过期wal文件时,并不是直接delete,而是在future log不满的情况下,重命名文件重新利用,future log满的话就删除了。

导致本次故障的原因是由于物理复制时使用了primary_slot_name,从而主机上存在了一个相应的物理复制槽,但是再服务器故障恢复后由于重新做的从库,未使用该复制槽,导致复制槽堵塞,无法清理wal文件。

replicationslot分逻辑和物理复制槽两种。

主要是提供了一种方法来确保主机在所有的从机(包括逻辑订阅端)收到 WAL 段 之前不会移除它们,并且主机也不会移除可能导致恢复冲突的行,即使从机断开连接也是如此,这就存在表膨胀的风险。

 

通常情况下设置primary_slot_name 参数值的目的就是为了防止由于备机长时间离线导致wal被从主机的pg_wal里移除掉,从而使主备无法恢复一致。

但是在主机存在归档的情况下,设置该值的意义也不大了,尤其是注意启动备机的时候如果之前设置了该复制槽参数,那么以后启动的时候都要携带上该参数值,否则会导致主机端该复制槽一直堵在那里,影响dead tuple和wal的清理。

walsender进程堆栈如下:

#0  0x00007fa4590255e3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x00000000008c0b16 in WaitEventSetWaitBlock (set=0x2c7dae8, cur_timeout=29999, occurred_events=0x7ffd6db90980, nevents=1) at latch.c:1080
#2  0x00000000008c09ec in WaitEventSetWait (set=0x2c7dae8, timeout=29999, occurred_events=0x7ffd6db90980, nevents=1, wait_event_info=83886092) at latch.c:1032
#3  0x00000000008c0149 in WaitLatchOrSocket (latch=0x7fa451a4eea4, wakeEvents=43, sock=10, timeout=29999, wait_event_info=83886092) at latch.c:407
#4  0x000000000087efcc in WalSndLoop (send_data=0x87f7d6 <XLogSendPhysical>) at walsender.c:2270
#5  0x000000000087cada in StartReplication (cmd=0x2c7ca30) at walsender.c:697
#6  0x000000000087e15a in exec_replication_command (cmd_string=0x2bf8ed8 "START_REPLICATION 0/51000000 TIMELINE 1") at walsender.c:1544
#7  0x00000000008f3e3c in PostgresMain (argc=1, argv=0x2c26598, dbname=0x2c26460 "", username=0x2bf5bc8 "debug") at postgres.c:4232
#8  0x0000000000847a08 in BackendRun (port=0x2c1c430) at postmaster.c:4437
#9  0x00000000008471d7 in BackendStartup (port=0x2c1c430) at postmaster.c:4128
#10 0x000000000084354e in ServerLoop () at postmaster.c:1704
#11 0x0000000000842de6 in PostmasterMain (argc=1, argv=0x2bf3b80) at postmaster.c:1377
#12 0x0000000000760aa1 in main (argc=1, argv=0x2bf3b80) at main.c:228


--核心处理流程在StartReplication函数内,如下:
--默认收到的参数(未指定primary_slot_name)为“START_REPLICATION 0/51000000 TIMELINE 1”



/*
 * Handle START_REPLICATION command.
 *
 * At the moment, this never returns, but an ereport(ERROR) will take us back
 * to the main loop.
 */
static void
StartReplication(StartReplicationCmd *cmd)
{
	StringInfoData buf;
	XLogRecPtr	FlushPtr;
//当前TIMELINEID是否为空,正常是从1开始的
	if (ThisTimeLineID == 0)
		ereport(ERROR,
				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
				 errmsg("IDENTIFY_SYSTEM has not been run before START_REPLICATION")));

	/*
	 * We assume here that we're logging enough information in the WAL for
	 * log-shipping, since this is checked in PostmasterMain().
	 *
	 * NOTE: wal_level can only change at shutdown, so in most cases it is
	 * difficult for there to be WAL data that we can still see that was
	 * written at wal_level='minimal'.
	 */
//如果在备机的recovery.conf里或者postgresql.conf(PG12)里指定primary_slot_name,在准备发送启动流复制的命令里就会设置slotname为该值。否则这个是null。
	if (cmd->slotname)
	{
		ReplicationSlotAcquire(cmd->slotname, true);
		if (SlotIsLogical(MyReplicationSlot))
			ereport(ERROR,
					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
					 (errmsg("cannot use a logical replication slot for physical replication"))));
	}