StartupXLOG函数流程

第一步从checkpoint xlog记录中提取信息,更新共享内存变量,比如nextFullXid、nextOid、oidCount。

LastRec = RecPtr = checkPointLoc;
/* initialize shared memory variables from the checkpoint record */
ShmemVariableCache->nextFullXid = checkPoint.nextFullXid;
ShmemVariableCache->nextOid = checkPoint.nextOid;
ShmemVariableCache->oidCount = 0;
MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
AdvanceOldestClogXid(checkPoint.oldestXid);
SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB, true);
SetCommitTsLimit(checkPoint.oldestCommitTsXid, checkPoint.newestCommitTsXid);
XLogCtl->ckptFullXid = checkPoint.nextFullXid;

​void MultiXactSetNextMXact(MultiXactId nextMulti, MultiXactOffset nextMultiOffset)​​ 设置下一个要分配的 MultiXactId 和偏移量。当我们可以从检查点记录中准确地确定正确的下一个 ID/偏移量时,就会使用它。 虽然这仅在引导和 XLog 重放期间调用,但我们会锁定以防任何热备后端(hot-standby backends)正在检查这些值。在二进制升级期间,请确保偏移量 SLRU 足够大以包含将要创建的下一个值。我们需要在二进制升级模式的第一次启动期间尽早执行此操作:实际上是在 StartupMultiXact() 之前,因为该例程甚至在 StartupXLOG() 之前就被调用了。 我们不能早于此时执行此操作,因为在此例程的第一次调用期间,我们确定了 MaybeExtendOffsetSlru 需要的 MultiXactState->nextMXact 值。

LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
MultiXactState->nextMXact = nextMulti;
MultiXactState->nextOffset = nextMultiOffset;
LWLockRelease(MultiXactGenLock);
/* During a binary upgrade, make sure that the offsets SLRU is large enough to contain the next value that would be created.
* We need to do this pretty early during the first startup in binary
* upgrade mode: before StartupMultiXact() in fact, because this routine
* is called even before that by StartupXLOG(). And we can't do it
* earlier than at this point, because during that first call of this
* routine we determine the MultiXactState->nextMXact value that
* MaybeExtendOffsetSlru needs. */
if (IsBinaryUpgrade) MaybeExtendOffsetSlru();

AdvanceOldestClogXid函数将最旧的有效阻塞条目的集群范围值提前(Advance the cluster-wide value for the oldest valid clog entry)。 我们必须获取 CLogTruncationLock 来推进最旧的 ClogXid。 在实际阻塞截断期间没有必要持有锁,只有当我们提前限制时,因为查找任意 xid 的代码需要从测试 oldClogXid 到完成阻塞查找时持有 CLogTruncationLock。

void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid) {
LWLockAcquire(CLogTruncationLock, LW_EXCLUSIVE);
if (TransactionIdPrecedes(ShmemVariableCache->oldestClogXid, oldest_datfrozenxid))
ShmemVariableCache->oldestClogXid = oldest_datfrozenxid;
LWLockRelease(CLogTruncationLock);
}

第二步初始化复制槽、启动logical状态MultiXact、启动CommitTs和启动ReplicationOrigin 。

/* Initialize replication slots, before there's a chance to remove required resources. */
StartupReplicationSlots(); // 在有机会删除所需资源之前初始化复制槽。
/* Startup logical state, needs to be setup now so we have proper data during crash recovery. */
StartupReorderBuffer(); // 启动逻辑状态,现在需要设置,以便我们在崩溃恢复期间拥有正确的数据。
/* Startup MultiXact. We need to do this early to be able to replay truncations. */
StartupMultiXact(); // 启动 MultiXact。 我们需要尽早这样做才能重播截断。
/* Ditto for commit timestamps. Activate the facility if the setting is
* enabled in the control file, as there should be no tracking of commit
* timestamps done when the setting was disabled. This facility can be
* started or stopped when replaying a XLOG_PARAMETER_CHANGE record. */
if (ControlFile->track_commit_timestamp) StartupCommitTs();
/* Recover knowledge about replay progress of known replication partners. */
StartupReplicationOrigin(); // 恢复有关已知复制伙伴的回放进度的知识。

StartupReplicationSlots函数在服务器启动时将所有复制槽从磁盘加载到内存中。 这需要在我们开始崩溃恢复之前运行。

void StartupReplicationSlots(void) {
DIR *replication_dir;
struct dirent *replication_de;
replication_dir = AllocateDir("pg_replslot"); /* restore all slots by iterating over all on-disk entries */
while ((replication_de = ReadDir(replication_dir, "pg_replslot")) != NULL) {
struct stat statbuf;
char path[MAXPGPATH + 12];
if (strcmp(replication_de->d_name, ".") == 0 || strcmp(replication_de->d_name, "..") == 0)
continue;
snprintf(path, sizeof(path), "pg_replslot/%s", replication_de->d_name);
if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode)) /* we're only creating directories here, skip if it's not our's */
continue;

/* we crashed while a slot was being setup or deleted, clean up */
if (pg_str_endswith(replication_de->d_name, ".tmp")) {
if (!rmtree(path, true)) {
ereport(WARNING,(errmsg("could not remove directory \"%s\"", path)));
continue;
}
fsync_fname("pg_replslot", true);
continue;
}
RestoreSlotFromDisk(replication_de->d_name); /* looks like a slot in a normal state, restore */
}
FreeDir(replication_dir);
if (max_replication_slots <= 0) return; /* currently no slots exist, we're done. */
/* Now that we have recovered all the data, compute replication xmin */
ReplicationSlotsComputeRequiredXmin(false);
ReplicationSlotsComputeRequiredLSN();
}

StartupReorderBuffer函数在我们重新启动/崩溃后删除所有溢出到磁盘的数据。 当相应的插槽被重用时,它将被重新创建。

void StartupReorderBuffer(void) {
DIR *logical_dir;
struct dirent *logical_de;
logical_dir = AllocateDir("pg_replslot");
while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL) {
if (strcmp(logical_de->d_name, ".") == 0 || strcmp(logical_de->d_name, "..") == 0) continue;
/* if it cannot be a slot, skip the directory */
if (!ReplicationSlotValidateName(logical_de->d_name, DEBUG2)) continue;
/* ok, has to be a surviving logical slot, iterate and delete everything starting with xid-* */
ReorderBufferCleanupSerializedTXNs(logical_de->d_name);
}
FreeDir(logical_dir);
}

StartupMultiXact函数必须在 postmaster 或独立后端启动期间调用一次。StartupXLOG 已经通过调用 MultiXactSetNextMXact 和/或 MultiXactAdvanceNextMXact 建立了 nextMXact/nextOffset,以及来自 pg_control 和/或 MultiXactAdvanceOldest 的 oldMulti 信息,但我们还没有重放 WAL。

void StartupMultiXact(void) {
MultiXactId multi = MultiXactState->nextMXact;
MultiXactOffset offset = MultiXactState->nextOffset;
int pageno;
/* Initialize offset's idea of the latest page number. */
pageno = MultiXactIdToOffsetPage(multi);
MultiXactOffsetCtl->shared->latest_page_number = pageno;
/* Initialize member's idea of the latest page number. */
pageno = MXOffsetToMemberPage(offset);
MultiXactMemberCtl->shared->latest_page_number = pageno;
}

StartupCommitTs函数必须在 postmaster 或独立后端启动期间调用一次,在 StartupXLOG 初始化 ShmemVariableCache->nextFullXid 之后。

void StartupCommitTs(void) { ActivateCommitTs(); }

StartupReplicationOrigin函数从 CheckPointReplicationOrigin 之前保存的检查点数据中恢复复制重播状态(Recover replication replay status)。这只需要在启动时调用,而不在之后恢复期间(例如,在 HS 或来自基本备份的 PITR 中)读取的每个检查点期间调用。 此后的所有状态都可以通过查看提交记录来恢复。

void StartupReplicationOrigin(void) {
const char *path = "pg_logical/replorigin_checkpoint";
int fd, readBytes;
uint32 magic = REPLICATION_STATE_MAGIC;
int last_state = 0;
pg_crc32c file_crc, crc;

if (max_replication_slots == 0) return;
INIT_CRC32C(crc);
fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
/* might have had max_replication_slots == 0 last run, or we just brought up a standby. */
if (fd < 0 && errno == ENOENT) return;
else if (fd < 0)
ereport(PANIC, (errcode_for_file_access(), errmsg("could not open file \"%s\": %m", path)));

readBytes = read(fd, &magic, sizeof(magic)); /* verify magic, that is written even if nothing was active */
if (readBytes != sizeof(magic)) {
if (readBytes < 0) ereport(PANIC, (errcode_for_file_access(), errmsg("could not read file \"%s\": %m",path)));
else ereport(PANIC,(errcode(ERRCODE_DATA_CORRUPTED),errmsg("could not read file \"%s\": read %d of %zu", path, readBytes, sizeof(magic))));
}
COMP_CRC32C(crc, &magic, sizeof(magic));

if (magic != REPLICATION_STATE_MAGIC)
ereport(PANIC,(errmsg("replication checkpoint has wrong magic %u instead of %u", magic, REPLICATION_STATE_MAGIC)));
/* we can skip locking here, no other access is possible */
/* recover individual states, until there are no more to be found */
while (true) {
ReplicationStateOnDisk disk_state;
readBytes = read(fd, &disk_state, sizeof(disk_state));
/* no further data */
if (readBytes == sizeof(crc)) {
file_crc = *(pg_crc32c *) &disk_state; /* not pretty, but simple ... */
break;
}
if (readBytes < 0) {
ereport(PANIC,(errcode_for_file_access(), errmsg("could not read file \"%s\": %m",path)));
}
if (readBytes != sizeof(disk_state)) {
ereport(PANIC,(errcode_for_file_access(),errmsg("could not read file \"%s\": read %d of %zu",path, readBytes, sizeof(disk_state))));
}
COMP_CRC32C(crc, &disk_state, sizeof(disk_state));

if (last_state == max_replication_slots) ereport(PANIC,(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),errmsg("could not find free replication state, increase max_replication_slots")));

/* copy data to shared memory */
replication_states[last_state].roident = disk_state.roident;
replication_states[last_state].remote_lsn = disk_state.remote_lsn;
last_state++;
elog(LOG, "recovered replication state of node %u to %X/%X",disk_state.roident,(uint32) (disk_state.remote_lsn >> 32),(uint32) disk_state.remote_lsn);
}

FIN_CRC32C(crc); /* now check checksum */
if (file_crc != crc) ereport(PANIC,(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED), errmsg("replication slot checkpoint has wrong checksum %u, expected %u",crc, file_crc)));
if (CloseTransientFile(fd)) ereport(PANIC,(errcode_for_file_access(),errmsg("could not close file \"%s\": %m",path)));
}

第三步初始化未记录的 LSN,我们必须使用创建它们的相同 TimeLineID 重放 WAL 条目,因此暂时采用检查点指示的 TLI。

/* Initialize unlogged LSN. On a clean shutdown, it's restored from the control file. On recovery, all unlogged relations are blown away, so the unlogged LSN counter can be reset too. */ // 初始化未记录的 LSN。 在干净关闭时,它会从控制文件中恢复。 恢复时,所有未记录的关系都被清除,因此未记录的 LSN 计数器也可以重置
if (ControlFile->state == DB_SHUTDOWNED)
XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
else
XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
/* We must replay WAL entries using the same TimeLineID they were created
* under, so temporarily adopt the TLI indicated by the checkpoint (see
* also xlog_redo()). */ // 我们必须使用创建它们的相同 TimeLineID 重放 WAL 条目,因此暂时采用检查点指示的 TLI(另请参见 xlog_redo())
ThisTimeLineID = checkPoint.ThisTimeLineID;

第四步主要调用两个函数restoreTimeLineHistoryFiles和restoreTwoPhaseData。
restoreTimeLineHistoryFiles将“现在”和恢复目标时间线之间的任何缺失时间线历史文件从存档复制到 pg_wal。 虽然我们自己不需要这些文件 - 恢复目标时间线的历史文件也涵盖了历史中所有以前的时间线 - 级联备用服务器可能对它们感兴趣。 或者,如果您将此服务器中的 WAL 存档到与主服务器不同的存档中,则最好在故障转移后将所有历史文件存档到那里,这样您就可以使用旧时间线之一作为 PITR 目标。 时间线历史文件很小,因此最好不必要地复制它们,而不是不复制它们然后后悔。
restoreTwoPhaseData在恢复运行之前,扫描 pg_twophase 并填写其状态,以便能够处理重做生成的条目。 在采取任何恢复操作之前进行扫描具有丢弃比要重放的第一条记录更新的任何 2PC 文件的优点,从而避免重放时的任何冲突。 这也避免了在恢复磁盘上的两阶段数据时进行的任何后续扫描。

/* Copy any missing timeline history files between 'now' and the recovery
* target timeline from archive to pg_wal. While we don't need those files
* ourselves - the history file of the recovery target timeline covers all
* the previous timelines in the history too - a cascading standby server
* might be interested in them. Or, if you archive the WAL from this
* server to a different archive than the master, it'd be good for all the
* history files to get archived there after failover, so that you can use
* one of the old timelines as a PITR target. Timeline history files are
* small, so it's better to copy them unnecessarily than not copy them and
* regret later. */
restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);

/* Before running in recovery, scan pg_twophase and fill in its status to
* be able to work on entries generated by redo. Doing a scan before
* taking any recovery action has the merit to discard any 2PC files that
* are newer than the first record to replay, saving from any conflicts at
* replay. This avoids as well any subsequent scans when doing recovery
* of the on-disk two-phase data. */
restoreTwoPhaseData();

lastFullPageWrites = checkPoint.fullPageWrites;
RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
doPageWrites = lastFullPageWrites;
if (RecPtr < checkPoint.redo) ereport(PANIC, (errmsg("invalid redo in checkpoint record")));
/* Check whether we need to force recovery from WAL. If it appears to have been a clean shutdown and we did not have a recovery signal file, then assume no recovery needed. */ // 检查我们是否需要强制从 WAL 中恢复。 如果它似乎是完全关闭并且我们没有恢复信号文件,则假设不需要恢复
if (checkPoint.redo < RecPtr) {
if (wasShutdown) ereport(PANIC, (errmsg("invalid redo record in shutdown checkpoint")));
InRecovery = true;
} else if (ControlFile->state != DB_SHUTDOWNED) InRecovery = true;
else if (ArchiveRecoveryRequested) InRecovery = true; /* force recovery due to presence of recovery signal file */