简介
greenplum是一个面向OLAP场景的开源分布式数据库,其在OLTP场景也具有众多应用,如银行,金融以及物流等领域。在分布式系统中确保分布式事务的一致性是重点研究对象之一,常见的策略如下:两阶段提交、三阶段提交、TCC以及基于paxos等分布式提交协议算法。
1 关键数据结构
GlobalTransactionData 全局事务结构体信息,描述其处prepare或即将prepared转态信息:包含事务ID、prepare 日志起始/末尾lsn以及QD与QE间的暗号 gid
/*
* This struct describes one global transaction that is in prepared state
* or attempting to become prepared.
*
* The lifecycle of a global transaction is:
*
* 1. After checking that the requested GID is not in use, set up an entry in
* the TwoPhaseState->prepXacts array with the correct GID and valid = false,
* and mark it as locked by my backend.
*
* 2. After successfully completing prepare, set valid = true and enter the
* referenced PGPROC into the global ProcArray.
*
* 3. To begin COMMIT PREPARED or ROLLBACK PREPARED, check that the entry is
* valid and not locked, then mark the entry as locked by storing my current
* backend ID into locking_backend. This prevents concurrent attempts to
* commit or rollback the same prepared xact.
*
* 4. On completion of COMMIT PREPARED or ROLLBACK PREPARED, remove the entry
* from the ProcArray and the TwoPhaseState->prepXacts array and return it to
* the freelist.
*
* Note that if the preparing transaction fails between steps 1 and 2, the
* entry must be removed so that the GID and the GlobalTransaction struct
* can be reused. See AtAbort_Twophase().
*
* typedef struct GlobalTransactionData *GlobalTransaction appears in
* twophase.h
*/
typedef struct GlobalTransactionData
{
GlobalTransaction next; /* list link for free list */
int pgprocno; /* ID of associated dummy PGPROC */
BackendId dummyBackendId; /* similar to backend id for backends */
TimestampTz prepared_at; /* time of preparation */
/*
* Note that we need to keep track of two LSNs for each GXACT. We keep
* track of the start LSN because this is the address we must use to read
* state data back from WAL when committing a prepared GXACT. We keep
* track of the end LSN because that is the LSN we need to wait for prior
* to commit.
*/
XLogRecPtr prepare_start_lsn; /* XLOG offset of prepare record start */
XLogRecPtr prepare_end_lsn; /* XLOG offset of prepare record end */
TransactionId xid; /* The GXACT id */
Oid owner; /* ID of user that executed the xact */
BackendId locking_backend; /* backend currently working on the xact */
bool valid; /* true if PGPROC entry is in proc array */
bool ondisk; /* true if prepare state file is on disk */
bool inredo; /* true if entry was added via xlog_redo */
char gid[GIDSIZE]; /* The GID assigned to the prepared xact */
} GlobalTransactionData;
TMGXACT:全局事务信息
typedef struct TMGXACT
{
/*
* Like PGPROC->xid to local transaction, gxid is set if distributed
* transaction needs two-phase, and it's reset when distributed
* transaction ends, with ProcArrayLock held.
*/
DistributedTransactionId gxid; // 用于两阶段提交
/*
* This is similar to xmin of PROC, stores lowest dxid on first snapshot
* by process with this as MyTmGxact.
*/
DistributedTransactionId xminDistributedSnapshot;
bool includeInCkpt;
int sessionId; // sessionId 标识
} TMGXACT;
2PC 状态信息
/*
* Two Phase Commit shared state. Access to this struct is protected
* by TwoPhaseStateLock.
*/
typedef struct TwoPhaseStateData
{
/* Head of linked list of free GlobalTransactionData structs */
GlobalTransaction freeGXacts;
/* Number of valid prepXacts entries. */
int numPrepXacts;
/* There are max_prepared_xacts items in this array */
GlobalTransaction prepXacts[FLEXIBLE_ARRAY_MEMBER];
} TwoPhaseStateData;
static TwoPhaseStateData *TwoPhaseState;
2 源码流程解析
2.1 prepared 阶段
第一阶段:prepare
QD:调用 doPreparedTransaction发起 prepare
1)首先通过全局事务号获取gid [可以理解成同一个分布式事务QD与QE间的联系暗号,因为QD上会执行多个分布式事务,因此通过此暗号,QD与QE之间能够准确通信];
2)然后构建 prepare 消息并将其序列化,通过 libpq协议分发至此事务所涉及的QE。
QE:调用 performDtxprotocolCommand 进行 prepare
1 )解析并反序列化QD发送的prepare请求消息,进入相应的处理逻辑 PrepareTransaction;
2) QE在本地收集 2PC信息,包括 TwoPhaseFileHeader信息,事务锁、谓词锁和MultiXact事务信息 [在后续Commit Prepared或者 Rollback 操作会使用];
3)完成上述操作,在本地写 prerare 日志并持久化,释放此操作过程中所使用的资源【不包括QE事务本身所占用的资源】
若QD收到全部QE 成功prepare结果,则会在本地写 DISTRIBUTED_COMMIT 日志并刷盘,如未收到,会进行重试数次最后回滚该事务。
2.2 commit
第二阶段:commmit
QD:完成一阶段提交后,调用 notifyCommittedDtxTransation 向QE发起 commit prepared 请求
1)首先通过全局事务号获取gid [可以理解成同一个分布式事务QD与QE间的联系暗号,因为QD上会执行多个分布式事务,因此通过此暗号,QD与QE之间能够准确通信];
2)然后构建 Commit prepared 消息并将其序列化,通过 libpq协议分发至此事务所涉及的QE。
QE:调用 performDtxprotocolCommand 进行 Commit prepared
[ DTX_PROTOCOL_COMMAND_COMMIT_PREPARED ]
1 )解析并反序列化QD发送的 Commit prepared 请求消息,进入相应的处理逻辑 performDtxProtocolCommitPrepared;
2) QE在本地开启新的事务,更新事务块和事务状态信息以及获取进行事务操作的相关资源
3)调用 FinishPreparedTransaction真正执行commit prepared操作:
a: 根据 gid 获取对应的全局事务信息,读取TwoPhaseFile文件,解析一阶段中记录的2PC转态数据信息【头信息、子事务、待删除/提交的数据库和表】;
b: 根据解析出来的信息写Commit Prepared日志并持久化[ 如果所有的QD日志均持久化,那么此分布式事务便完成,即使后续QD//QE宕机,都可以通过会回放此日志恢复至一致性状态];在本地写CLOG日志。
c: 更新全局变量 ShmemVariableCache->latestCompletedXid, 并从全局共享ProcArrary数组中移除此分布式事务对应的PROC结构体信息和 TwoPhaseState 事务信息
d: 最后从磁盘中移除 TwoPhaseState文件
4)释放此事务所占用的内存和锁等资源
QD收到QE结果有两种情况:
- 所有的QE均commit prepared成功,会调用 doInsertForgetCommitted 函数在本地写 XLOG_XACT_DISTRIBUTED_FORGET日志,已表明该分布式事务已正式提交完成。对于后续事务均可见。
2)若只收到部分QE,则会发起重试执行上述同样的commit prepared步骤,超过一定次数均为收到全部commit结果,则会回滚此事务。
最终释放此分布式事务所占用的各种资源如内存、锁等,更新系统信息