目标

这篇文章解答两个问题

1、一次查询,多个物化视图都可以满足查询条件,会选择哪个物化视图

2、查询优化过程中,怎么把一次Query拆分成多个子查询

一次查询过程,会调用DataSourceOptimizer.optimize()方法进行优化。TopNQuery/TimeseriesQuery/GroupByQuery类型的查询,会查找符合条件物化视图进行查询优化。

自然会有开头的两个疑问:查询优化的过程是怎么样的? 如果多个物化视图满足条件,选择哪个物化视图呢


优化流程

Query查询优化的流程:

【Druid】查询过程,多个物化视图符合,选哪个?_查询优化


1、输入查询Query(baseDataSource, Fields, Intervals)

2、通过baseDataSource从元数据中查找该baseDataSource的所有物化视图DerivativeDataSource,放入TreeSet中。形成TreeSet<DerivativeDataSource>

Tips: ​DerivativeDataSource实现了compareTo()方法,放入TreeSet中的DerivativeDataSource是有序的。按avgSizeBasedGranularity升序。

3、按查询中Fields,从所有物化视图TreeSet<DerivativeDataSource>中,筛选符合条件的物化视图Set<DerivativeDataSource> derivativesWithRequiredFields

4、拆分查询中的Intervals,每个Interval都从Set<DerivativeDataSource> derivativesWithRequiredFields匹配对应的物化视图。

这个过程需要遍历所有Intervals,如果有满足Interval的物化视图存在,则将其转化成sub-query保存到List中。未能匹配到物化视图的Interval,则直接从BaseDataSource查询

Tips:因为第2步骤,保存DerivativeDataSource有序,所以从derivativesWithRequiredFields匹配Interval的过程,avgSizeBasedGranularity小的优先用于该Interval查询,并转换成sub-query。

5、将所有sub-query放入List中返回

所以,解答开篇疑问一:一次查询,多个物化视图都可以满足查询条件,会选择哪个物化视图?

在查询过程中,如果多个物化视图符合查询条件(一个查询命中多个物化视图的情况),基于Granularity的平均大小。会命中avgSizeBasedGranularity最小的物化视图。这段逻辑中利用了TreeSet集合有序存储。


数据举例Demo

假设原始DataSource:base,其有两个物化视图,分别是derivative,derivative2。这些datasource覆盖Intervals, dims, size 分别如下json表示.

{
"dataSource":"base",
"intervals":[
"2011-04-01/2011-04-02",
"2011-04-02/2011-04-03",
"2011-04-03/2011-04-04",
"2011-04-04/2011-04-05",
"2011-04-05/2011-04-06"
],
"dims": ["dim1", "dim2", "dim3", "dim4"],
"size": "1024 * 1024"
}

{
"dataSource":"derivative",
"intervals":[
"2011-04-01/2011-04-02",
"2011-04-02/2011-04-03",
"2011-04-03/2011-04-04"
],
"dims": ["dim1", "dim2", "dim3"],
"size": "1024"
}

{
"dataSource":"derivative2",
"intervals":[
"2011-04-03/2011-04-04"
],
"dims": ["dim1", "dim2"],
"size": "1023"
}


现在有个查询


SELECT 
dim1, SUM(cost)
FROM base_datasource
WHERE __time >=2011-04-01 AND __time <=2011-04-06
GROUP BY dim1
LIMIT 4;

1、这次查询

时间范围Intervals: [2011-04-01, 2011-04-06]

查询字段Fields [dim1, cost]

2、从元数据查找basedatasource的所有物化视图,有[derivative2, derivative],并放入TreeSet<DerivativeDataSource>中。

注意物化视图在TreeSet中的顺序是按照size从小到大升序。derivative2:1023, derivative:1024

3、通过查询字段Fields [dim1, cost],发现[derivative2, derivative]都满足条件,将其放入Set<DerivativeDataSource> derivativesWithRequiredFields中

4、按每个Interval从物化视图集合derivativesWithRequiredFields进行匹配

5、最终形成List<Query>

0 = {TopNQuery@6873} "TopNQuery{dataSource='derivative2', dimensionSpec=DefaultDimensionSpec{dimension='dim1', outputName='dim1', outputType='STRING'}, topNMetricSpec=NumericTopNMetricSpec{metric='cost'}, threshold=4, querySegmentSpec=MultipleIntervalSegmentSpec{intervals=[2011-04-03T00:00:00.000Z/2011-04-04T00:00:00.000Z]}, virtualColumns=[], dimFilter=null, granularity='AllGranularity', aggregatorSpecs=[LongSumAggregatorFactory{fieldName='cost', expression='null', name='cost'}], postAggregatorSpecs=[]}"
1 = {TopNQuery@6874} "TopNQuery{dataSource='derivative', dimensionSpec=DefaultDimensionSpec{dimension='dim1', outputName='dim1', outputType='STRING'}, topNMetricSpec=NumericTopNMetricSpec{metric='cost'}, threshold=4, querySegmentSpec=MultipleIntervalSegmentSpec{intervals=[2011-04-01T00:00:00.000Z/2011-04-03T00:00:00.000Z]}, virtualColumns=[], dimFilter=null, granularity='AllGranularity', aggregatorSpecs=[LongSumAggregatorFactory{fieldName='cost', expression='null', name='cost'}], postAggregatorSpecs=[]}"
2 = {TopNQuery@6875} "TopNQuery{dataSource='base', dimensionSpec=DefaultDimensionSpec{dimension='dim1', outputName='dim1', outputType='STRING'}, topNMetricSpec=NumericTopNMetricSpec{metric='cost'}, threshold=4, querySegmentSpec=MultipleIntervalSegmentSpec{intervals=[2011-04-04T00:00:00.000Z/2011-04-06T00:00:00.000Z]}, virtualColumns=[], dimFilter=null, granularity='AllGranularity', aggregatorSpecs=[LongSumAggregatorFactory{fieldName='cost', expression='null', name='cost'}], postAggregatorSpecs=[]}"

intervals=[2011-04-03T00:00:00.000Z/2011-04-04T00:00:00.000Z]从dataSource='derivative2'查询

intervals=[2011-04-01T00:00:00.000Z/2011-04-03T00:00:00.000Z]从dataSource='derivative'查询

intervals=[2011-04-04T00:00:00.000Z/2011-04-06T00:00:00.000Z]从dataSource='base'查询,base是原始DataSource,并不是物化视图