explain select * from test where collecttime > '2023-02-22 00:00:00' and collecttime < '2023-02-28 00:00:00';
                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Gather Motion 12:1 (slice1; segments: 12) (cost=0.00..431.18 rows=39 width=1626)
  -> Sequence (cost=0.00..431.02 rows=4 width=1626)
       -> Partition Selector for test (dynamic scan id: 1) (cost=10.00...100.00 rows=9 width=4)
            Partition selected: 6 (out of 33)
       -> Dynamic Seq Scan on test (dynamic scan id: 1) (cost=0.00...431.02 rows=4 width=1626)
            Filter: ((collecttime > '2023-02-22 00:00:00'::timestamp without time zone) AND (collecttime < '2023-02-28 00:00:00'::timestamp without time zone))   
Optimizer: Pivotal Optimizer (GPORCA)  

explain select * from test where collecttime > '2023-02-22 00:00:00' and collecttime < '2023-02-28 00:00:00'; 
                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Gather Motion 12:1 (slice1; segments: 12) (cost=0.00..431.18 rows=6 width=3748)
  -> Append (cost=0.00...0.00 rows=1 width=3748)
       -> Seq Scan on test_1_prt_p20230222_1 (cost=0.00..0.00 rows=1 width=3748)
            Filter: ((collecttime > '2023-02-22 00:00:00'::timestamp without time zone) AND (collecttime < '2023-02-28 00:00:00'::timestamp without time zone))
        ...
       -> Seq Scan on test_1_prt_p20230227_1 (cost=0.00..0.00 rows=1 width=3748)
            Filter: ((collecttime > '2023-02-22 00:00:00'::timestamp without time zone) AND (collecttime < '2023-02-28 00:00:00'::timestamp without time zone))  
Optimizer:  Postgres query optimizer

从上述执行计划对比可以看出,Postgresql优化器中每个分区子表都对应一个SeqScan,并由Append节点串联;而再ORCA优化器中,Dynamic Seq Scan则包含了分区裁剪后所有分区子表的扫描动作。这里主要描述Dynamic Seq Scan的执行流程,后续会关注Partition Selector节点。

Plan *CTranslatorDXLToPlStmt::TranslateDXLDynTblScan(const CDXLNode *dyn_tbl_scan_dxlnode, CDXLTranslateContext *output_context, CDXLTranslationContextArray *ctxt_translation_prev_siblings)用于将DXL dynamic table scan节点转化为DynamicSeqScan节点,以提供给执行器执行。

和execScan.c的强大功能不同,src/backend/executor/execDynamicScan.c文件提供仅仅提供辅助函数isDynamicScan【Returns true if the scan node is dynamic (i.e., determining relations to scan at runtime)】、DynamicScan_GetDynamicScanId【Returns the index into EState->dynamicTableScanInfo arrays for this dynamic scan node】、DynamicScan_GetDynamicScanIdPrintable【Return “printable” scan id for a node, for EXPLAIN】、DynamicScan_GetTableOid【Returns the Oid of the table/partition to scan】、DynamicScan_SetTableOid【Select a partition to scan in a dynamic scan】和DynamicScan_RemapExpression【Re-maps the expression using the provided attMap】,具体函数流程非常简单,这里不做赘述。

DynamicSeqscan数据结构

DynamicSeqScan包含了SeqScan,主要成员是partIndex,用于索引EState->dynamicTableScanInfo中的pidIndexes和curRelOids列表,pidIndexes[partIndex-1]是一个哈希表HTAB,其key是Oid(子分区oid),其entry是PartOidEntry( typedef struct PartOidEntry{ Oid partOid; List *selectorList; } /* list of patition selectors that produced the above part oid */),而curRelOids[partIndex - 1]则存放着当前正在SCAN的分区子表的oid。

typedef struct DynamicSeqScan {	
	SeqScan		seqscan; /* Fields shared with a normal SeqScan. Must be first! */
	/* Index to arrays in EState->dynamicTableScanInfo, that contain information about the partitiones that need to be scanned. */
	int32 		partIndex; // 主要用于索引EState->dynamicTableScanInfo中的pidIndexes和curRelOids列表
	
	int32 		partIndexPrintable;
} DynamicSeqScan;

DynamicSeqScanState中的scan_state由四种取值组成SCAN_INIT、SCAN_SCAN、SCAN_DONE、SCAN_END(SCAN_INIT: we are initializing the scan state | SCAN_SCAN: all initializations for reading tuples are done and we are either reading tuples, or ready to read tuples | SCAN_DONE: we are done with all relations/partitions, but the scan state is still valid for a ReScan (i.e., we haven’t destroyed our scan state yet) | SCAN_END: we are completely done. We cannot ReScan, without redoing the whole initialization phase again.)。firstPartition成员用于标识扫描第一个分区子表。pidIndex用于关联all unique partition pids(partition oids),其在第一次进入ExecDynamicSeqScan函数是被设置为EState->dynamicTableScanInfo->pidIndexes[partIndex-1];pidStatus用于顺序访问pidIndex哈希表中所有的oids,Scan通过该成员获取下一个要扫描的子分区的oid。由于要使用复用SeqScan函数扫描子分区,因此在DynamicSeqScanState结构体中包含了seqScanState成员,并且使用ss_table哈希表缓存了子分区使用的seqScanState,可以在rescan时减少seqScanState的分配和初始化工作,cached_relids用于存放已扫描和正扫描的子分区oid。lastRelOid用于存放子分区表结构变更后的第一个子分区oid,用于复用表结构元数据,而无需再次获取并初始化相关qual、targetlist等结构。

typedef struct DynamicSeqScanState{
	ScanState	ss; // ss.ps.plan <- DynamicSeqScan  ss.ps.state <- estate ss.ps.qual ss.ps.targetlist
	int			scan_state; /* the stage of scanning */
	int			eflags;	
	Index		scanrelid;	/* scanrelid is the RTE index for this scan node. It will be used to select varno whose varattno will be remapped, if necessary */
	bool		firstPartition; /* The first partition requires initialization of expression states, such as qual and targetlist, regardless of whether we need to re-map varattno */

	
	HTAB	   *pidIndex; /* Pid index that maintains all unique partition pids for this dynamic table scan to scan. */	
	HASH_SEQ_STATUS pidStatus; /* The status of sequentially scan the pid index. */
	/* Should we call hash_seq_term()? This is required to handle error condition, where we are required to explicitly call hash_seq_term(). Also, if we don't have any partition, this flag should prevent ExecEndDynamicSeqScan from calling hash_seq_term() on a NULL hash table. */
	bool		shouldCallHashSeqTerm;
		
	SeqScanState *seqScanState;
    HTAB *ss_table; // key: Oid entry: ScanOidEntry( typedef struct ScanOidEntry{ Oid rel_id; void *ss] )
	List *cached_relids; 
	Oid			lastRelOid; /* lastRelOid is the last relation that corresponds to the varattno mapping of qual and target list. Each time we open a new partition, we will compare the last relation with current relation by using varattnos_map() and then convert the varattno to the new varattno */
	
	MemoryContext partitionMemoryContext; /* This memory context will be reset per-partition to free up previous partition's memory */
} DynamicSeqScanState;

DynamicSeqscan相关函数

ExecInitDynamicSeqScan函数处理和seqscan一样初始化外,将firstPartiton初始化为true(扫描第一个分区),缓存上一次表结构不变更时的表oid,分配为每个分区的qual和targetlist使用的partitionMemoryContext,创建缓冲分区使用的seqScanState的HTAB ss_table和每个分区oid列表cached_relids。

ExecDynamicSeqScan函数执行流程分为两个部分:一是pidIndex在第一次进入ExecDynamicSeqScan函数时被设置为EState->dynamicTableScanInfo->pidIndexes[partIndex-1];二是for循环获取数据元组,由于DynamicSeqScan利用了SeqScan接口,所以必须切换分区,因此其seqScanState也是不同的,其执行流程如下:

  • seqScanState为null,调用initNextTableToScan换新分区
  • ExecSeqScan(node->seqScanState)获取元组并返回,否则向下执行
  • 调用CleanupOnePartiton(node)清理seqScanState

initNextTableToScan函数的执行流程如下:

  • 通过pidStatus获取下一个需要扫描分区的oid
  • 通过ss_table查找是否有对应分区oid缓存的seqScanState
  • 获取seqscanstate的ss_currentRelation,如果当前relation的表结构和上次扫描的分区表lastRelOid结构不同,就需要更新qual和targetlist中的varattno,并更改lastRelOid为当前分区oid。如果需要更新或为第一个分区,则需要在partitionMemoryContext上下文中,重新初始化qual和targetlist。将当前分区oid设置到EState->dynamicTableScanInfo->curRelOids[partIndex-1]中,即标记正在扫描的分区。
  • 如果是从ss_table找到缓存的seqscanstate,则需要执行ExecReScan,否则需要调用ExecInitSeqScanForParition重写创建,将该seqScanState缓存到ss_table中,将该分区表的oid追加都cached_relids中。