explain select * from test where collecttime > '2023-02-22 00:00:00' and collecttime < '2023-02-28 00:00:00';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Gather Motion 12:1 (slice1; segments: 12) (cost=0.00..431.18 rows=39 width=1626)
-> Sequence (cost=0.00..431.02 rows=4 width=1626)
-> Partition Selector for test (dynamic scan id: 1) (cost=10.00...100.00 rows=9 width=4)
Partition selected: 6 (out of 33)
-> Dynamic Seq Scan on test (dynamic scan id: 1) (cost=0.00...431.02 rows=4 width=1626)
Filter: ((collecttime > '2023-02-22 00:00:00'::timestamp without time zone) AND (collecttime < '2023-02-28 00:00:00'::timestamp without time zone))
Optimizer: Pivotal Optimizer (GPORCA)
explain select * from test where collecttime > '2023-02-22 00:00:00' and collecttime < '2023-02-28 00:00:00';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Gather Motion 12:1 (slice1; segments: 12) (cost=0.00..431.18 rows=6 width=3748)
-> Append (cost=0.00...0.00 rows=1 width=3748)
-> Seq Scan on test_1_prt_p20230222_1 (cost=0.00..0.00 rows=1 width=3748)
Filter: ((collecttime > '2023-02-22 00:00:00'::timestamp without time zone) AND (collecttime < '2023-02-28 00:00:00'::timestamp without time zone))
...
-> Seq Scan on test_1_prt_p20230227_1 (cost=0.00..0.00 rows=1 width=3748)
Filter: ((collecttime > '2023-02-22 00:00:00'::timestamp without time zone) AND (collecttime < '2023-02-28 00:00:00'::timestamp without time zone))
Optimizer: Postgres query optimizer
从上述执行计划对比可以看出,Postgresql优化器中每个分区子表都对应一个SeqScan,并由Append节点串联;而再ORCA优化器中,Dynamic Seq Scan则包含了分区裁剪后所有分区子表的扫描动作。这里主要描述Dynamic Seq Scan的执行流程,后续会关注Partition Selector节点。
Plan *CTranslatorDXLToPlStmt::TranslateDXLDynTblScan(const CDXLNode *dyn_tbl_scan_dxlnode, CDXLTranslateContext *output_context, CDXLTranslationContextArray *ctxt_translation_prev_siblings)
用于将DXL dynamic table scan节点转化为DynamicSeqScan节点,以提供给执行器执行。
和execScan.c的强大功能不同,src/backend/executor/execDynamicScan.c文件提供仅仅提供辅助函数isDynamicScan【Returns true if the scan node is dynamic (i.e., determining relations to scan at runtime)】、DynamicScan_GetDynamicScanId【Returns the index into EState->dynamicTableScanInfo arrays for this dynamic scan node】、DynamicScan_GetDynamicScanIdPrintable【Return “printable” scan id for a node, for EXPLAIN】、DynamicScan_GetTableOid【Returns the Oid of the table/partition to scan】、DynamicScan_SetTableOid【Select a partition to scan in a dynamic scan】和DynamicScan_RemapExpression【Re-maps the expression using the provided attMap】,具体函数流程非常简单,这里不做赘述。
DynamicSeqscan数据结构
DynamicSeqScan包含了SeqScan,主要成员是partIndex,用于索引EState->dynamicTableScanInfo
中的pidIndexes和curRelOids列表,pidIndexes[partIndex-1]是一个哈希表HTAB,其key是Oid(子分区oid),其entry是PartOidEntry( typedef struct PartOidEntry{ Oid partOid; List *selectorList; } /* list of patition selectors that produced the above part oid */
),而curRelOids[partIndex - 1]则存放着当前正在SCAN的分区子表的oid。
typedef struct DynamicSeqScan {
SeqScan seqscan; /* Fields shared with a normal SeqScan. Must be first! */
/* Index to arrays in EState->dynamicTableScanInfo, that contain information about the partitiones that need to be scanned. */
int32 partIndex; // 主要用于索引EState->dynamicTableScanInfo中的pidIndexes和curRelOids列表
int32 partIndexPrintable;
} DynamicSeqScan;
DynamicSeqScanState中的scan_state由四种取值组成SCAN_INIT、SCAN_SCAN、SCAN_DONE、SCAN_END(SCAN_INIT: we are initializing the scan state | SCAN_SCAN: all initializations for reading tuples are done and we are either reading tuples, or ready to read tuples | SCAN_DONE: we are done with all relations/partitions, but the scan state is still valid for a ReScan (i.e., we haven’t destroyed our scan state yet) | SCAN_END: we are completely done. We cannot ReScan, without redoing the whole initialization phase again.)。firstPartition成员用于标识扫描第一个分区子表。pidIndex用于关联all unique partition pids(partition oids),其在第一次进入ExecDynamicSeqScan函数是被设置为EState->dynamicTableScanInfo->pidIndexes[partIndex-1];pidStatus用于顺序访问pidIndex哈希表中所有的oids,Scan通过该成员获取下一个要扫描的子分区的oid。由于要使用复用SeqScan函数扫描子分区,因此在DynamicSeqScanState结构体中包含了seqScanState成员,并且使用ss_table哈希表缓存了子分区使用的seqScanState,可以在rescan时减少seqScanState的分配和初始化工作,cached_relids用于存放已扫描和正扫描的子分区oid。lastRelOid用于存放子分区表结构变更后的第一个子分区oid,用于复用表结构元数据,而无需再次获取并初始化相关qual、targetlist等结构。
typedef struct DynamicSeqScanState{
ScanState ss; // ss.ps.plan <- DynamicSeqScan ss.ps.state <- estate ss.ps.qual ss.ps.targetlist
int scan_state; /* the stage of scanning */
int eflags;
Index scanrelid; /* scanrelid is the RTE index for this scan node. It will be used to select varno whose varattno will be remapped, if necessary */
bool firstPartition; /* The first partition requires initialization of expression states, such as qual and targetlist, regardless of whether we need to re-map varattno */
HTAB *pidIndex; /* Pid index that maintains all unique partition pids for this dynamic table scan to scan. */
HASH_SEQ_STATUS pidStatus; /* The status of sequentially scan the pid index. */
/* Should we call hash_seq_term()? This is required to handle error condition, where we are required to explicitly call hash_seq_term(). Also, if we don't have any partition, this flag should prevent ExecEndDynamicSeqScan from calling hash_seq_term() on a NULL hash table. */
bool shouldCallHashSeqTerm;
SeqScanState *seqScanState;
HTAB *ss_table; // key: Oid entry: ScanOidEntry( typedef struct ScanOidEntry{ Oid rel_id; void *ss] )
List *cached_relids;
Oid lastRelOid; /* lastRelOid is the last relation that corresponds to the varattno mapping of qual and target list. Each time we open a new partition, we will compare the last relation with current relation by using varattnos_map() and then convert the varattno to the new varattno */
MemoryContext partitionMemoryContext; /* This memory context will be reset per-partition to free up previous partition's memory */
} DynamicSeqScanState;
DynamicSeqscan相关函数
ExecInitDynamicSeqScan函数处理和seqscan一样初始化外,将firstPartiton初始化为true(扫描第一个分区),缓存上一次表结构不变更时的表oid,分配为每个分区的qual和targetlist使用的partitionMemoryContext,创建缓冲分区使用的seqScanState的HTAB ss_table和每个分区oid列表cached_relids。
ExecDynamicSeqScan函数执行流程分为两个部分:一是pidIndex在第一次进入ExecDynamicSeqScan函数时被设置为EState->dynamicTableScanInfo->pidIndexes[partIndex-1];二是for循环获取数据元组,由于DynamicSeqScan利用了SeqScan接口,所以必须切换分区,因此其seqScanState也是不同的,其执行流程如下:
- seqScanState为null,调用initNextTableToScan换新分区
- ExecSeqScan(node->seqScanState)获取元组并返回,否则向下执行
- 调用CleanupOnePartiton(node)清理seqScanState
initNextTableToScan函数的执行流程如下:
- 通过pidStatus获取下一个需要扫描分区的oid
- 通过ss_table查找是否有对应分区oid缓存的seqScanState
- 获取seqscanstate的ss_currentRelation,如果当前relation的表结构和上次扫描的分区表lastRelOid结构不同,就需要更新qual和targetlist中的varattno,并更改lastRelOid为当前分区oid。如果需要更新或为第一个分区,则需要在partitionMemoryContext上下文中,重新初始化qual和targetlist。将当前分区oid设置到EState->dynamicTableScanInfo->curRelOids[partIndex-1]中,即标记正在扫描的分区。
- 如果是从ss_table找到缓存的seqscanstate,则需要执行ExecReScan,否则需要调用ExecInitSeqScanForParition重写创建,将该seqScanState缓存到ss_table中,将该分区表的oid追加都cached_relids中。