目标
这篇文章解答两个问题
1、一次查询,多个物化视图都可以满足查询条件,会选择哪个物化视图
2、查询优化过程中,怎么把一次Query拆分成多个子查询
一次查询过程,会调用DataSourceOptimizer.optimize()方法进行优化。TopNQuery/TimeseriesQuery/GroupByQuery类型的查询,会查找符合条件物化视图进行查询优化。
自然会有开头的两个疑问:查询优化的过程是怎么样的? 如果多个物化视图满足条件,选择哪个物化视图呢
优化流程
Query查询优化的流程:
1、输入查询Query(baseDataSource, Fields, Intervals)
2、通过baseDataSource从元数据中查找该baseDataSource的所有物化视图DerivativeDataSource,放入TreeSet中。形成TreeSet<DerivativeDataSource>
Tips: DerivativeDataSource实现了compareTo()方法,放入TreeSet中的DerivativeDataSource是有序的。按avgSizeBasedGranularity升序。
3、按查询中Fields,从所有物化视图TreeSet<DerivativeDataSource>中,筛选符合条件的物化视图Set<DerivativeDataSource> derivativesWithRequiredFields
4、拆分查询中的Intervals,每个Interval都从Set<DerivativeDataSource> derivativesWithRequiredFields匹配对应的物化视图。
这个过程需要遍历所有Intervals,如果有满足Interval的物化视图存在,则将其转化成sub-query保存到List中。未能匹配到物化视图的Interval,则直接从BaseDataSource查询
Tips:因为第2步骤,保存DerivativeDataSource有序,所以从derivativesWithRequiredFields匹配Interval的过程,avgSizeBasedGranularity小的优先用于该Interval查询,并转换成sub-query。
5、将所有sub-query放入List中返回
所以,解答开篇疑问一:一次查询,多个物化视图都可以满足查询条件,会选择哪个物化视图?
在查询过程中,如果多个物化视图符合查询条件(一个查询命中多个物化视图的情况),基于Granularity的平均大小。会命中avgSizeBasedGranularity最小的物化视图。这段逻辑中利用了TreeSet集合有序存储。
数据举例Demo
假设原始DataSource:base,其有两个物化视图,分别是derivative,derivative2。这些datasource覆盖Intervals, dims, size 分别如下json表示.
{
"dataSource":"base",
"intervals":[
"2011-04-01/2011-04-02",
"2011-04-02/2011-04-03",
"2011-04-03/2011-04-04",
"2011-04-04/2011-04-05",
"2011-04-05/2011-04-06"
],
"dims": ["dim1", "dim2", "dim3", "dim4"],
"size": "1024 * 1024"
}
{
"dataSource":"derivative",
"intervals":[
"2011-04-01/2011-04-02",
"2011-04-02/2011-04-03",
"2011-04-03/2011-04-04"
],
"dims": ["dim1", "dim2", "dim3"],
"size": "1024"
}
{
"dataSource":"derivative2",
"intervals":[
"2011-04-03/2011-04-04"
],
"dims": ["dim1", "dim2"],
"size": "1023"
}
现在有个查询
SELECT
dim1, SUM(cost)
FROM base_datasource
WHERE __time >=2011-04-01 AND __time <=2011-04-06
GROUP BY dim1
LIMIT 4;
1、这次查询
时间范围Intervals: [2011-04-01, 2011-04-06]
查询字段Fields [dim1, cost]
2、从元数据查找basedatasource的所有物化视图,有[derivative2, derivative],并放入TreeSet<DerivativeDataSource>中。
注意物化视图在TreeSet中的顺序是按照size从小到大升序。derivative2:1023, derivative:1024
3、通过查询字段Fields [dim1, cost],发现[derivative2, derivative]都满足条件,将其放入Set<DerivativeDataSource> derivativesWithRequiredFields中
4、按每个Interval从物化视图集合derivativesWithRequiredFields进行匹配
5、最终形成List<Query>
0 = {TopNQuery@6873} "TopNQuery{dataSource='derivative2', dimensionSpec=DefaultDimensionSpec{dimension='dim1', outputName='dim1', outputType='STRING'}, topNMetricSpec=NumericTopNMetricSpec{metric='cost'}, threshold=4, querySegmentSpec=MultipleIntervalSegmentSpec{intervals=[2011-04-03T00:00:00.000Z/2011-04-04T00:00:00.000Z]}, virtualColumns=[], dimFilter=null, granularity='AllGranularity', aggregatorSpecs=[LongSumAggregatorFactory{fieldName='cost', expression='null', name='cost'}], postAggregatorSpecs=[]}"
1 = {TopNQuery@6874} "TopNQuery{dataSource='derivative', dimensionSpec=DefaultDimensionSpec{dimension='dim1', outputName='dim1', outputType='STRING'}, topNMetricSpec=NumericTopNMetricSpec{metric='cost'}, threshold=4, querySegmentSpec=MultipleIntervalSegmentSpec{intervals=[2011-04-01T00:00:00.000Z/2011-04-03T00:00:00.000Z]}, virtualColumns=[], dimFilter=null, granularity='AllGranularity', aggregatorSpecs=[LongSumAggregatorFactory{fieldName='cost', expression='null', name='cost'}], postAggregatorSpecs=[]}"
2 = {TopNQuery@6875} "TopNQuery{dataSource='base', dimensionSpec=DefaultDimensionSpec{dimension='dim1', outputName='dim1', outputType='STRING'}, topNMetricSpec=NumericTopNMetricSpec{metric='cost'}, threshold=4, querySegmentSpec=MultipleIntervalSegmentSpec{intervals=[2011-04-04T00:00:00.000Z/2011-04-06T00:00:00.000Z]}, virtualColumns=[], dimFilter=null, granularity='AllGranularity', aggregatorSpecs=[LongSumAggregatorFactory{fieldName='cost', expression='null', name='cost'}], postAggregatorSpecs=[]}"
intervals=[2011-04-03T00:00:00.000Z/2011-04-04T00:00:00.000Z]从dataSource='derivative2'查询
intervals=[2011-04-01T00:00:00.000Z/2011-04-03T00:00:00.000Z]从dataSource='derivative'查询
intervals=[2011-04-04T00:00:00.000Z/2011-04-06T00:00:00.000Z]从dataSource='base'查询,base是原始DataSource,并不是物化视图