WAL相当于oracle中的redo log,mysql中的redolog,9.6及之前名为xlog,10+当前在pg_wal文件夹中,wal段默认大小为16M,在initdb时可以指定大小,后续原则上不可以修改。可以通过​​pg_waldump​​查看二进制日志的内容。​​wal的结构解析​​虽然不完全正确,比如LSN的物理文件ID解析就不正确)。wal的物理结构(定义在src/include/access/xlogrecord.h)如下:

postgresql中clog(commit log)内部实现、及与wal、commit的关系及细节_#define

typedef struct XLogRecord
{
uint32 xl_tot_len; /* total len of entire record */
TransactionId xl_xid; /* xact id */
XLogRecPtr xl_prev; /* ptr to previous record in log */
uint8 xl_info; /* flag bits, see below */
RmgrId xl_rmid; /* resource manager for this record */
/* 2 bytes of padding here, initialize to zero */
pg_crc32c xl_crc; /* CRC for this record */

/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */

} XLogRecord;

typedef struct XLogRecordBlockHeader
{
uint8 id; /* block reference ID */
uint8 fork_flags; /* fork within the relation, and flags */
uint16 data_length; /* number of payload bytes (not including page
* image) */

/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
/* If BKPBLOCK_SAME_REL is not set, a RelFileNode follows */
/* BlockNumber follows */
} XLogRecordBlockHeader;

  其中BlockHeader和block data一一对应,其中场景之一是第一次写入块或checkpoint之后第一次修改块时会生成block data。所以每个日志条目的最小长度为24字节+n字节(记录null标记,1个bit 1位,所以等于n字段数/8)+main data。

具体为:解析wal记录的时候,首先读取XLogRecord,并根据xl_tot_len读取整条记录,然后根据其中的xl_rmid判断当前是哪个RMGR,调用具体rmgr read回调,RMGR首先根据xl_info(info是第二个关键口子,也是RMGR后的第一个关键依据)判断接下去是XLogRecordBlockHeader(0个或多个)、XLogRecordDataHeader(有且只有一个)还是其它(就是一事一议)。

   WAL归档的执行过程可见​​https://wiki.moritetu.xyz/?PostgreSQL/%E8%A7%A3%E6%9E%90/WAL%E3%82%A2%E3%83%BC%E3%82%AB%E3%82%A4%E3%83%96​​。

​PostgreSQL transaction-commit-log manager​​​,主要在clog.c中实现)里面记录了事务的执行状态,每次事务提交和回滚的时候,都需要更新该状态(调用CommitTransactionCommand(void)),PostgreSQL服务器访问该文件确定事务的状态,保存在pg_xact目录中,每个文件大小为256KB,每个事务2位(bit),故1个文件可以包含131072个事务。对于第一次修改的数据行来说,因为事务状态存储在clog中,所以修改后第一次判断行的可见性需要通过访问clog来确定,而访问clog是一个非常耗费性能的过程,故关于​​clog访问优化​​,有一个很长的discussion。

  事务在clog中的状态包括:

/*
* Possible transaction statuses --- note that all-zeroes is the initial
* state.
*
* A "subcommitted" transaction is a committed subtransaction whose parent
* hasn't committed or aborted yet.
*/
typedef int XidStatus;

#define TRANSACTION_STATUS_IN_PROGRESS 0x00
#define TRANSACTION_STATUS_COMMITTED 0x01
#define TRANSACTION_STATUS_ABORTED 0x02
#define TRANSACTION_STATUS_SUB_COMMITTED 0x03

  在clog.c中。

  因为pg的MVCC在文件中实现undo,即使事务回滚了,新创建的行也不会被删除,但是因为clog中记录了事务的执行状态,所以​​其他事务在xmin和xmax判断时候可以过滤掉或不过滤掉这些记录​​(主要是xmax=0的情况,因为此时可能提交了、也可能稍微提交)。

pg_xact(9.6及之前名为pg_clog,虽然代码中还是clog.c)

[postgres@hs-10-20-30-194 pg_xact]$ ll
total 13208
-rw------- 1 postgres postgres 262144 May 24 17:26 0000
-rw------- 1 postgres postgres 262144 May 24 17:26 0001
-rw------- 1 postgres postgres 262144 May 24 17:27 0002
-rw------- 1 postgres postgres 262144 May 24 17:27 0003
-rw------- 1 postgres postgres 262144 May 24 17:27 0004
-rw------- 1 postgres postgres 262144 May 24 17:28 0005
-rw------- 1 postgres postgres 262144 May 24 17:28 0006
-rw------- 1 postgres postgres 262144 May 24 17:28 0007

  clog和wal的交互:这得先理解事务的完整过程。

  

postgresql中clog(commit log)内部实现、及与wal、commit的关系及细节_bc_02

  在AM层,调用xlog相关接口将WAL条目写入WAL文件,PortalDrop清理执行完成后,主入口exec_simple_query()->finish_xact_command()会依次调用CommitTransactionCommand()->CommitTransaction()->RecordTransactionCommit()->XactLogCommitRecord()调用XLogInsert()将commit wal条目写入WAL文件,然后RecordTransactionCommit()调用XLogFlush刷新commit WAL日志,然后调用TransactionIdCommitTree()更新clog。TransactionIdCommitTree->TransactionIdSetTreeStatus->TransactionIdSetPageStatus->TransactionIdSetPageStatusInternal,然后根据pageno找到slotno(使用​​slru​​简单最近最少访问算法管理),调用TransactionIdSetStatusBit(其根据xid找到偏移量,然后进行位运算更新事务状态)

* Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
* CLOG page numbering also wraps around at 0xFFFFFFFF/CLOG_XACTS_PER_PAGE,
* and CLOG segment numbering at
* 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT. We need take no
* explicit notice of that fact in this module, except when comparing segment
* and page numbers in TruncateCLOG (see CLOGPagePrecedes).
*/

/* We need two bits per xact, so four xacts fit in a byte */
#define CLOG_BITS_PER_XACT 2
#define CLOG_XACTS_PER_BYTE 4 每字节包含的事务数
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE) 每BLOCK包含的事务数,32768
#define CLOG_XACT_BITMASK ((1 << CLOG_BITS_PER_XACT) - 1) 0x11

#define TransactionIdToPage(xid) ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE) 根据事务ID找到页,事务ID 整除 32768
#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) 页内事务相对顺序号偏移量 事务ID 取余 32768
#define TransactionIdToByte(xid) (TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE) 页内字节偏移量
#define TransactionIdToBIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_BYTE) 字节内事务相对顺序号偏移量 事务ID 取余 4

postgresql中clog(commit log)内部实现、及与wal、commit的关系及细节_#define_03

 

 /* We store the latest async LSN for each group of transactions *

#define CLOG_XACTS_PER_LSN_GROUP 32 /* keep this a power of 2 */
#define CLOG_LSNS_PER_PAGE (CLOG_XACTS_PER_PAGE / CLOG_XACTS_PER_LSN_GROUP)

#define GetLSNIndex(slotno, xid) ((slotno) * CLOG_LSNS_PER_PAGE + \
((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) / CLOG_XACTS_PER_LSN_GROUP)

重启恢复的时候哪里会用到?2、判断元祖可见性的时候哪里会调用到?

  所有的元祖在被fetch时,都会检查xmin、xmax是否已经提交,如果infomask_2上没有标记的话,就回去clog缓存区查询,如下:

TransactionIdGetStatus clog.c:654
TransactionLogFetch transam.c:79
TransactionIdDidCommit transam.c:129
HeapTupleSatisfiesMVCC heapam_visibility.c:1058
HeapTupleSatisfiesVisibility heapam_visibility.c:1695
heapgetpage heapam.c:476
heapgettup_pagemode heapam.c:917
heap_getnextslot heapam.c:1390
table_scan_getnextslot tableam.h:906
SeqNext nodeSeqscan.c:80
ExecScanFetch execScan.c:133
ExecScan execScan.c:182
ExecSeqScan nodeSeqscan.c:112
ExecProcNodeFirst execProcnode.c:454
ExecProcNode executor.h:248
ExecutePlan execMain.c:1632
standard_ExecutorRun execMain.c:350
CitusExecutorRun multi_executor.c:214
pgss_ExecutorRun pg_stat_statements.c:1043
pgsk_ExecutorRun pg_stat_kcache.c:1034
pgqs_ExecutorRun pg_qualstats.c:661
explain_ExecutorRun auto_explain.c:334
ExecutorRun execMain.c:292
PortalRunSelect pquery.c:912
PortalRun pquery.c:756
exec_simple_query postgres.c:1325
PostgresMain postgres.c:4415
BackendRun postmaster.c:4527
BackendStartup postmaster.c:4211
ServerLoop postmaster.c:1740
PostmasterMain postmaster.c:1413
main main.c:231
__libc_start_main 0x00007f3353efd555
_start 0x0000000000483aa9

 

/*
* information stored in t_infomask:
*/
#define HEAP_HASNULL 0x0001 /* has null attribute(s) */
#define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */
#define HEAP_HASEXTERNAL 0x0004 /* has external stored attribute(s) */
#define HEAP_HASOID_OLD 0x0008 /* has an object-id field */
#define HEAP_XMAX_KEYSHR_LOCK 0x0010 /* xmax is a key-shared locker */
#define HEAP_COMBOCID 0x0020 /* t_cid is a combo cid */
#define HEAP_XMAX_EXCL_LOCK 0x0040 /* xmax is exclusive locker */
#define HEAP_XMAX_LOCK_ONLY 0x0080 /* xmax, if valid, is only a locker */

/* xmax is a shared locker */
#define HEAP_XMAX_SHR_LOCK (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)

#define HEAP_LOCK_MASK (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \
HEAP_XMAX_KEYSHR_LOCK)
#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
#define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */
#define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */
#define HEAP_XMAX_IS_MULTI 0x1000 /* t_xmax is a MultiXactId */
#define HEAP_UPDATED 0x2000 /* this is UPDATEd version of row */
#define HEAP_MOVED_OFF 0x4000 /* moved to another place by pre-9.0
* VACUUM FULL(cluster同义词); kept for binary
* upgrade support */
#define HEAP_MOVED_IN 0x8000 /* moved from another place by pre-9.0
* VACUUM FULL; kept for binary
* upgrade support */
#define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)

#define HEAP_XACT_MASK 0xFFF0 /* visibility-related bits */

  每当一个新的clog页面(和pg中其他页面一样,也是BLCKSZ宏定义,默认8KB)被初始化为0的时候,clog.c就会生成一条wal记录。xact.c中针对提交和回滚操作的记录(recording)也会写clog。对于同步提交:在clog记录commit前,XLOG会确保被刷新,所以WAL可以自动被保证。对于异步提交:必须跟踪最新的LSN影响的每个CLOG页,这样才能刷新响应的xlog。clog的细节描述具体可以参见:​​https://www.interdb.jp/pg/pgsql05.html​​​。clog的清理参见:​​https://www.interdb.jp/pg/pgsql06.html#_6.4.​​,由vacuum freeze负责清理。

postgresql中clog(commit log)内部实现、及与wal、commit的关系及细节_3c_04


其他目录说明:

  • pg_logical
  • pg_commit_ts
  • pg_multixact
  • pg_subtrans
  • pg_snapshots
  • pg_replslot
  • pg_dynshmem

​9.6目录说明​

​10.0+目录说明​​(到14为止未在发生调整) 

​https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND​

vacuum的内部调用链路

[lightdb@hs-10-20-30-217 ~]$ pstack 2393906
#0 0x00007f7eb7bb0a86 in ?? () from /lib64/libpthread.so.0
#1 0x00007f7eb7bb0b78 in ?? () from /lib64/libpthread.so.0
#2 0x0000000000710202 in PGSemaphoreLock (sema=0x7f4e36e3e8b8) at pg_sema.c:327
#3 0x0000000000785cf5 in LWLockAcquire (lock=0x7f7deb25bb00, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1331
#4 0x0000000000764d2a in StartBufferIO (buf=buf@entry=0x7f4e5a3a7b00, forInput=forInput@entry=true) at bufmgr.c:4188
#5 0x0000000000766cdd in BufferAlloc (foundPtr=0x7ffe45a794fb, strategy=0x2501320, blockNum=<optimized out>, forkNum=MAIN_FORKNUM, relpersistence=112 'p', smgr=0x258bc30) at bufmgr.c:1286
#6 ReadBuffer_common (smgr=0x258bc30, relpersistence=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=<optimized out>, blockNum@entry=4198708, mode=RBM_NORMAL, strategy=0x2501320, hit=0x7ffe45a795af) at bufmgr.c:761
#7 0x0000000000767360 in ReadBufferExtended (reln=reln@entry=0x7f4e367f7548, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=4198708, mode=mode@entry=RBM_NORMAL, strategy=<optimized out>) at bufmgr.c:677
#8 0x00000000004dbcc2 in lazy_scan_heap (aggressive=true, nindexes=2, Irel=0x25682d8, vacrelstats=<optimized out>, params=0x2501434, onerel=<optimized out>) at vacuumlazy.c:1092
#9 heap_vacuum_rel (onerel=<optimized out>, params=0x2501434, bstrategy=<optimized out>) at vacuumlazy.c:518
#10 0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x2501434, rel=0x7f4e367f7548) at ../../../src/include/access/tableam.h:1463
#11 vacuum_rel (relid=29477, relation=<optimized out>, params=params@entry=0x2501434) at vacuum.c:1893
#12 0x000000000062e1cb in vacuum (relations=0x2539038, params=params@entry=0x2501434, bstrategy=<optimized out>, bstrategy@entry=0x2501320, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#13 0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x2501320, tab=0x2501430) at autovacuum.c:3137
#14 do_autovacuum () at autovacuum.c:2467
#15 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#16 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#17 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#18 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#19 <signal handler called>
#20 0x00007f7eb745a987 in select () from /lib64/libc.so.6
#21 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#22 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x24490d0) at postmaster.c:1488
#23 0x000000000048e7de in main (argc=1, argv=0x24490d0) at main.c:231
[lightdb@hs-10-20-30-217 ~]$ pstack 2393906
#0 0x00007f7eb745a987 in select () from /lib64/libc.so.6
#1 0x00000000008f0d8a in pg_usleep (microsec=<optimized out>) at pgsleep.c:56
#2 0x000000000062ece3 in vacuum_delay_point () at vacuum.c:2034
#3 0x00000000004dbc1d in lazy_scan_heap (aggressive=true, nindexes=2, Irel=0x25682d8, vacrelstats=<optimized out>, params=0x2501434, onerel=<optimized out>) at vacuumlazy.c:1034
#4 heap_vacuum_rel (onerel=<optimized out>, params=0x2501434, bstrategy=<optimized out>) at vacuumlazy.c:518
#5 0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x2501434, rel=0x7f4e367f7548) at ../../../src/include/access/tableam.h:1463
#6 vacuum_rel (relid=29477, relation=<optimized out>, params=params@entry=0x2501434) at vacuum.c:1893
#7 0x000000000062e1cb in vacuum (relations=0x2539038, params=params@entry=0x2501434, bstrategy=<optimized out>, bstrategy@entry=0x2501320, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#8 0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x2501320, tab=0x2501430) at autovacuum.c:3137
#9 do_autovacuum () at autovacuum.c:2467
#10 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#11 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#12 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#13 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#14 <signal handler called>
#15 0x00007f7eb745a987 in select () from /lib64/libc.so.6
#16 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#17 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x24490d0) at postmaster.c:1488
#18 0x000000000048e7de in main (argc=1, argv=0x24490d0) at main.c:231
[lightdb@hs-10-20-30-217 ~]$ pstack 2393906
#0 0x00007f7eb745a987 in select () from /lib64/libc.so.6
#1 0x00000000008f0d8a in pg_usleep (microsec=<optimized out>) at pgsleep.c:56
#2 0x000000000062ece3 in vacuum_delay_point () at vacuum.c:2034
#3 0x00000000004dbc1d in lazy_scan_heap (aggressive=true, nindexes=2, Irel=0x25682d8, vacrelstats=<optimized out>, params=0x2501434, onerel=<optimized out>) at vacuumlazy.c:1034
#4 heap_vacuum_rel (onerel=<optimized out>, params=0x2501434, bstrategy=<optimized out>) at vacuumlazy.c:518
#5 0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x2501434, rel=0x7f4e367f7548) at ../../../src/include/access/tableam.h:1463
#6 vacuum_rel (relid=29477, relation=<optimized out>, params=params@entry=0x2501434) at vacuum.c:1893
#7 0x000000000062e1cb in vacuum (relations=0x2539038, params=params@entry=0x2501434, bstrategy=<optimized out>, bstrategy@entry=0x2501320, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#8 0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x2501320, tab=0x2501430) at autovacuum.c:3137
#9 do_autovacuum () at autovacuum.c:2467
#10 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#11 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#12 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#13 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#14 <signal handler called>
#15 0x00007f7eb745a987 in select () from /lib64/libc.so.6
#16 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#17 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x24490d0) at postmaster.c:1488
#18 0x000000000048e7de in main (argc=1, argv=0x24490d0) at main.c:231
[lightdb@hs-10-20-30-217 ~]$


[lightdb@hs-10-20-30-217 base]$ pstack 3695188
#0 0x00007f1394acb987 in select () from /lib64/libc.so.6
#1 0x00000000008f0d8a in pg_usleep (microsec=<optimized out>) at pgsleep.c:56
#2 0x000000000062ece3 in vacuum_delay_point () at vacuum.c:2034
#3 0x00000000004e8635 in btvacuumpage (vstate=vstate@entry=0x7fff02c75f80, scanblkno=scanblkno@entry=81599) at nbtree.c:1109
#4 0x00000000004e8e21 in btvacuumscan (info=info@entry=0x7fff02c76130, stats=stats@entry=0x1106688, callback=callback@entry=0x4daac0 <lazy_tid_reaped>, callback_state=callback_state@entry=0x7efc67f13048, cycleid=<optimized out>) at nbtree.c:1029
#5 0x00000000004e8f7e in btbulkdelete (info=0x7fff02c76130, stats=0x1106688, callback=0x4daac0 <lazy_tid_reaped>, callback_state=0x7efc67f13048) at nbtree.c:871
#6 0x00000000004da85e in lazy_vacuum_index (indrel=<optimized out>, stats=stats@entry=0x10f3df8, dead_tuples=0x7efc67f13048, reltuples=263139328, vacrelstats=vacrelstats@entry=0x10f3a58) at vacuumlazy.c:2444
#7 0x00000000004db677 in lazy_vacuum_all_indexes (onerel=onerel@entry=0x7efca8105c08, Irel=Irel@entry=0x10f3648, stats=stats@entry=0x10f3dd8, vacrelstats=vacrelstats@entry=0x10f3a58, lps=lps@entry=0x0, nindexes=nindexes@entry=12) at vacuumlazy.c:1810
#8 0x00000000004dbecf in lazy_scan_heap (aggressive=true, nindexes=12, Irel=0x10f3648, vacrelstats=<optimized out>, params=0x10b27c4, onerel=<optimized out>) at vacuumlazy.c:1696
#9 heap_vacuum_rel (onerel=<optimized out>, params=0x10b27c4, bstrategy=<optimized out>) at vacuumlazy.c:518
#10 0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x10b27c4, rel=0x7efca8105c08) at ../../../src/include/access/tableam.h:1463
#11 vacuum_rel (relid=9217732, relation=<optimized out>, params=params@entry=0x10b27c4) at vacuum.c:1893
#12 0x000000000062e1cb in vacuum (relations=0x111f518, params=params@entry=0x10b27c4, bstrategy=<optimized out>, bstrategy@entry=0x10b2550, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#13 0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x10b2550, tab=0x10b27c0) at autovacuum.c:3137
#14 do_autovacuum () at autovacuum.c:2467
#15 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#16 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#17 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#18 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#19 <signal handler called>
#20 0x00007f1394acb987 in select () from /lib64/libc.so.6
#21 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#22 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0xfd0110) at postmaster.c:1488
#23 0x000000000048e7de in main (argc=1, argv=0xfd0110) at main.c:231
[lightdb@hs-10-20-30-217 base]$ pstack 3695188
#0 0x00007f1395222ff4 in pread64 () from /lib64/libpthread.so.0
#1 0x000000000076cb41 in FileRead (file=<optimized out>, buffer=buffer@entry=0x7efd0674fb00 "Q\002", amount=amount@entry=8192, offset=offset@entry=814096384, wait_event_info=wait_event_info@entry=167772173) at fd.c:2102
#2 0x00000000007902a0 in mdread (reln=<optimized out>, forknum=<optimized out>, blocknum=99377, buffer=0x7efd0674fb00 "Q\002") at md.c:663
#3 0x0000000000766a05 in ReadBuffer_common (smgr=0x10cc198, relpersistence=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=<optimized out>, blockNum@entry=99377, mode=RBM_NORMAL, strategy=0x10b2550, hit=0x7fff02c74e9f) at bufmgr.c:916
#4 0x0000000000767360 in ReadBufferExtended (reln=reln@entry=0x7efca810b6d8, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=99377, mode=mode@entry=RBM_NORMAL, strategy=<optimized out>) at bufmgr.c:677
#5 0x00000000004e864c in btvacuumpage (vstate=vstate@entry=0x7fff02c75f80, scanblkno=scanblkno@entry=99377) at nbtree.c:1117
#6 0x00000000004e8e21 in btvacuumscan (info=info@entry=0x7fff02c76130, stats=stats@entry=0x1106688, callback=callback@entry=0x4daac0 <lazy_tid_reaped>, callback_state=callback_state@entry=0x7efc67f13048, cycleid=<optimized out>) at nbtree.c:1029
#7 0x00000000004e8f7e in btbulkdelete (info=0x7fff02c76130, stats=0x1106688, callback=0x4daac0 <lazy_tid_reaped>, callback_state=0x7efc67f13048) at nbtree.c:871
#8 0x00000000004da85e in lazy_vacuum_index (indrel=<optimized out>, stats=stats@entry=0x10f3df8, dead_tuples=0x7efc67f13048, reltuples=263139328, vacrelstats=vacrelstats@entry=0x10f3a58) at vacuumlazy.c:2444
#9 0x00000000004db677 in lazy_vacuum_all_indexes (onerel=onerel@entry=0x7efca8105c08, Irel=Irel@entry=0x10f3648, stats=stats@entry=0x10f3dd8, vacrelstats=vacrelstats@entry=0x10f3a58, lps=lps@entry=0x0, nindexes=nindexes@entry=12) at vacuumlazy.c:1810
#10 0x00000000004dbecf in lazy_scan_heap (aggressive=true, nindexes=12, Irel=0x10f3648, vacrelstats=<optimized out>, params=0x10b27c4, onerel=<optimized out>) at vacuumlazy.c:1696
#11 heap_vacuum_rel (onerel=<optimized out>, params=0x10b27c4, bstrategy=<optimized out>) at vacuumlazy.c:518
#12 0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x10b27c4, rel=0x7efca8105c08) at ../../../src/include/access/tableam.h:1463
#13 vacuum_rel (relid=9217732, relation=<optimized out>, params=params@entry=0x10b27c4) at vacuum.c:1893
#14 0x000000000062e1cb in vacuum (relations=0x111f518, params=params@entry=0x10b27c4, bstrategy=<optimized out>, bstrategy@entry=0x10b2550, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#15 0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x10b2550, tab=0x10b27c0) at autovacuum.c:3137
#16 do_autovacuum () at autovacuum.c:2467
#17 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#18 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#19 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#20 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#21 <signal handler called>
#22 0x00007f1394acb987 in select () from /lib64/libc.so.6
#23 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#24 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0xfd0110) at postmaster.c:1488
#25 0x000000000048e7de in main (argc=1, argv=0xfd0110) at main.c:231


[lightdb@hs-10-20-30-217 data]$ pstack 4021998
#0 0x00007f78c40ac987 in select () from /lib64/libc.so.6
#1 0x00000000008f0d8a in pg_usleep (microsec=<optimized out>) at pgsleep.c:56
#2 0x000000000062ece3 in vacuum_delay_point () at vacuum.c:2034
#3 0x00000000004dce20 in lazy_scan_heap (aggressive=true, nindexes=12, Irel=0x2a18588, vacrelstats=<optimized out>, params=0x29d03cc, onerel=<optimized out>) at vacuumlazy.c:984
#4 heap_vacuum_rel (onerel=<optimized out>, params=0x29d03cc, bstrategy=<optimized out>) at vacuumlazy.c:518
#5 0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x29d03cc, rel=0x7f61d764f158) at ../../../src/include/access/tableam.h:1463
#6 vacuum_rel (relid=9217732, relation=<optimized out>, params=params@entry=0x29d03cc) at vacuum.c:1893
#7 0x000000000062e1cb in vacuum (relations=0x2a20378, params=params@entry=0x29d03cc, bstrategy=<optimized out>, bstrategy@entry=0x29d0158, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#8 0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x29d0158, tab=0x29d03c8) at autovacuum.c:3137
#9 do_autovacuum () at autovacuum.c:2467
#10 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#11 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#12 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#13 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#14 <signal handler called>
#15 0x00007f78c40ac987 in select () from /lib64/libc.so.6
#16 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#17 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x28f4110) at postmaster.c:1488
#18 0x000000000048e7de in main (argc=1, argv=0x28f4110) at main.c:231

 ​​https://cloud.tencent.com/developer/article/2027383​

wal的内部实现解析

  WALbuffer分配在共享内存中,最小64KB,通过XLogCtlInsert控制具体backend的申请,作为XLogCtlData的成员对外提供,无单独访问接口。
45000 53000 会话1
61000 69000 会话2
69000 77000 会话1
通过调试的指针地址从上可见,确实是共享。
  XLogInsert->XLogInsertRecord->CopyXLogRecordToWAL,只负责拷贝到wal buffer中的buffer(根据lsn位置/wal块大小计算而来,类似整除,会提前计算wal lsn_start和lsn_end)(包括过大到下一页),并不写入wal文件。
LOG: INSERT @ 0/2091D708: - Heap/INSERT: off 3 flags 0x08
LOG: INSERT @ 0/2091D778: - Transaction/COMMIT: 2022-10-16 23:44:59.926298+08
  上述仅表示生成到XLogInsertRecord,无写入的意思。通过系统调用跟踪也可以看出,如下:
[{EPOLLIN, {u32=23473288, u64=23473288}}], 1, -1) = 1
recvfrom(11, "Q\0\0\0.insert into t values('wocao"..., 8192, 0, NULL, NULL) = 47
lseek(6, 0, SEEK_END) = 8192
open("base/14199/16388_fsm", O_RDWR) = 7
lseek(7, 0, SEEK_END) = 24576
open("base/14199/16388", O_RDWR) = 8
lseek(8, 0, SEEK_END) = 245760
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=180, tv_usec=0}}, NULL) = 0
sendto(11, "C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T", 22, 0, NULL, 0) = 22
recvfrom(11, 0xf94780, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5,
------------insert没有产生pwrite64系统调用

[{EPOLLIN, {u32=23473288, u64=23473288}}], 1, -1) = 1
recvfrom(11, "Q\0\0\0\fcommit;\0", 8192, 0, NULL, NULL) = 13
open("lt_wal/000000010000000000000001", O_RDWR) = 9
pwrite64(9, "\6\321\5\0\1\0\0\0\0\0\222 \0\0\0\0%\32\0\0\0\0\0\0yyyyyyyy"..., 8192, 9568256) = 8192
fdatasync(9) = 0
sendto(10, "\2\0\0\0\250\3\0\0w7\0\0\10\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 936, 0, NULL, 0) = 936
sendto(10, "\2\0\0\0\350\1\0\0w7\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 488, 0, NULL, 0) = 488
sendto(10, "\16\0\0\0H\0\0\0\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 72, 0, NULL, 0) = 72
sendto(11, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I", 18, 0, NULL, 0) = 18
recvfrom(11, 0xf94780, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5,

  如果wal buffer 64KB,满了的话,那会怎么样呢?理论上需要8个单行8000字节的请求就可以占满,按照实现,是进行evict淘汰不再需要的wal buffer的,但是实际未写到wal文件的话,怎么evict出去呢?后面如果又写会怎么样?这个还需要研究!!!

  当前wal文件句柄保存在xlog.c文件的openLogFile全局变量中
  因为wal也是分8k为单位读写,通过XLogWritePages写入wal文件,每次8K*n字节,那如何保证每次写小量而不重复拷贝呢?比如几十字节?实际是在commit flush(RecordTransactionCommit函数中)的时候,每次写仍然是8KB为单位,具体由startoffset += XLogWritePages(from, npages, startoffset)每次写的from和startoffset参数可以看出。因为文件系统写以块为单位,所以写8KB(如果文件系统8KB块的话,如果512/4KB,则多了8倍或2倍)和80字节本质上的io是一样的。