在短视频项目开发中曾经遇到过一个SQL优化的问题,SQL是实现多维报表的统计,有8个维度,有曝光、点击、播放几种行为的pv、uv统计,此类需求也比较常见。

         需求如下:

数据量大的表需要键索引吗 表数据过大怎么优化sql_oracle单表数据量大怎么优化

         需求有9个维度,报表是每天统计,日期等于统计周期,常量,可以不考虑,剩下还有8个维度(国家、系统、版本、用户类型、实验名称、实验分组、行为入口、是否点击push)。

         实现这个需求最常见的写法是groupby with cube,实现的伪代码如下:

数据量大的表需要键索引吗 表数据过大怎么优化sql_sql优化常用的几种方法_02

这种写法在数据量小维度少的场景是行之有效的,但是数据量大维度多的时候就会跑的很吃力。尝试第一种写法,执行了4个小时30分钟后,SQL执行记录如下所示:

数据量大的表需要键索引吗 表数据过大怎么优化sql_数据量大的表需要键索引吗_03


为什么需要执行这么长时间,来看看数据量,符合统计需求的事件记录数是4.38 亿条,每个用户平均价会有8 ( 21486346/2658004=8.08 )个实验,大致估算基础数据量有35 亿,而维度有8 个,也就是需要256(2 的8 次方) 次group by 运算,每次group by 还有几个count distinct ,的确很吃力。

数据量大的表需要键索引吗 表数据过大怎么优化sql_oracle单表数据量大怎么优化_04

数据量大的表需要键索引吗 表数据过大怎么优化sql_oracle单表数据量大怎么优化_05

由于需要进行256次group by,成本比较高,尝试第二种写法,通过lateral view explode先把基础数据膨胀后直接进行group by,实现的伪代码如下:

数据量大的表需要键索引吗 表数据过大怎么优化sql_sql sum 去重_06

Lateral view explode写法执行时间2小时10分钟,相比group bywithcube写法提升一倍,但是2小时还是比较长。SQL执行记录如下所示:

数据量大的表需要键索引吗 表数据过大怎么优化sql_sql sum 去重_07

有一点特别要注意的是,明细数据膨胀完一定要数据落地,落地之后数据会重分布,如果数据没有落地,执行时间是7小时35分钟,比原来group by with cube写法还要缓慢。

数据量大的表需要键索引吗 表数据过大怎么优化sql_sql优化常用的几种方法_08

由于之前有kylin的项目经验,当时做优化时就想是否可参考kylin的分层叠加的做法来实现需求,kylin通过逐层构建的方法实现整个cube的构建。

数据量大的表需要键索引吗 表数据过大怎么优化sql_oracle单表数据量大怎么优化_09

整个立方体称为1个cube,立方体中每个网格点称为1个cuboid,图中(time,item,location,supplier)和(time,supplier)都是cuboid,最多维度的组合(time,item,location,supplier)称为Basecuboid。cube的计算过程是逐层计算的,首先计算Base cuboid,然后计算维度数依次减少,逐层向上计算每层的cuboid。

现实遇到的2个问题:1、参照kylin的做法使用逐层叠加的方式会遇到去重的问题,由于kylin有bitmap实现精准去重,应该如何处理不同层的去重问题;2、由于8个层次,如果逐层叠加,SQL会非常长,可读性也是要考虑的问题。

改进后的方法是尝试先找到最小可叠加cube,最小可叠加cube以上的维度是用户数是可以直接进行求和(sum)计算的,执行结果和明细数据直接countdistinct的结果必须是一致的,最小可叠加维度以下的维度可以如果少可以通过groupingsets计算得到。

在这个需求中用户数会引发去重的问题,对于可叠加维度,合计的度量值等于各位维度的度量值相加,而不可叠加维度,合计的度量值小于各位维度的度量值相加。例如国家维度,一个用户只属于一个国家,那么在计算所有国家的用户就可以使用各个国家的用户直接相加;而入口维度,一个用户在多个入口都会有记录,在算总入口用户数时就无法直接累加。

分析之后,可叠加维度有:国家、系统、版本、用户类型(新、老)、是否点击push,而不可叠加维度有:实验名称、实验分组、行为入口,那么先实现不可叠加维度的分组汇总,再计算可叠加的统计值。

实现的伪代码如下所示:

数据量大的表需要键索引吗 表数据过大怎么优化sql_sql优化常用的几种方法_10

实际的sql改造后执行时间仅为15分钟不到,效率有本质上的提升。SQL执行记录如下所示:

数据量大的表需要键索引吗 表数据过大怎么优化sql_sql sum 去重_11

在相同集群相同队列的情况,三种写法的执行时间如下所示:

写法

执行时间

groupByWithCube

4小时30分钟

lateralViewExplode

2小时10分钟

MiniSum

15分钟

groupByWithCube的写法对于数据量小维度少的情况是使用的,lateralViewExplode写法相比groupByWithCube性能有较大的提升,而最小可叠加cube的方法在数据量大维度多的情况是可以带来质的提升。

    最后附上三种写法的SQL,已经对库名和表名做了模糊处理:

--------------------------------------groupByWithCube--------------------------------------
set hive.new.job.grouping.set.cardinality=1024;
create temporary table t_device as
select 
dt,
hdid,
sys,
ver,
if(country in ('印度','印度尼西亚','巴西'),country,'其它') as country,
device_type,
if_click_push
from test.dwf_act_device_info_d 
where dt = '2019-06-25' and regexp_extract(ver,'(\\d+\\.){1,2}(\\d+)',0)>= '1.4';
create temporary table tmp_hdid_test_info as 
select dt,hdid,test.layer_id,test.test_id,test.layer_id_name,test.test_id_name
from ( select dt,hdid,test_info
       from test.dwf_act_device_info_d t
       where dt = '2019-06-25' )q
lateral view explode(test_info) test_info_tb as test
group by dt,hdid,test.layer_id,test.test_id,test.layer_id_name,test.test_id_name;
create temporary table t_consume_event as
select  
        hdid
      ,event['page_source'] as page_source
      ,sum(if(event_id='10201-0006',1,0))                     as tab_show_pv    --子tab曝光次数
      ,sum(if(event_id='10201-0001',1,0))                     as cover_show_pv  --视频封面曝光次数 
      ,sum(if(event_id='10201-0002',1,0))                     as cover_click_pv --视频封面点击次数
      ,sum(if(event_id='10202-0001',1,0))                     as detail_show_pv --详情页曝光次数
      ,sum(if(event_id='10202-0002',1,0))                     as play_pv        --视频播放次数
      ,sum(if(event_id='10202-0002' and event['loop_cnt']='1',1,0))    as first_play_pv  --视频首播次数(剔除循环播放)
      ,sum(if(event_id='10202-0002' and event['report_type']='1',1,0)) as full_play_pv   --视频完播次数
from test.dwv_consumer_event_d
where  dt = '2019-06-25'  
  and  event_id  in ('10201-0006','10201-0001','10201-0002','10202-0001','10202-0002') 
  and  event['page_source'] in ('1','2','3')
group by hdid 
       ,event['page_source']
;
select '2019-06-25' as dt
      ,nvl(t1.country        ,'all') as country        
      ,nvl(t1.sys            ,'all') as sys            
      ,nvl(t1.ver            ,'all') as ver            
      ,nvl(t1.device_type    ,'all') as device_type    
      ,nvl(t2.layer_id_name  ,'all') as layer_id_name  
      ,nvl(t2.test_id_name   ,'all') as test_id_name   
      ,nvl(t3.page_source    ,'all') as page_source    
      ,nvl(t1.if_click_push  ,'all') as if_click_push 
      ,sum(tab_show_pv             ) as tab_show_pv    --子tab曝光次数
      ,sum(cover_show_pv           ) as cover_show_pv  --视频封面曝光次数 
      ,sum(cover_click_pv          ) as cover_click_pv --视频封面点击次数
      ,sum(detail_show_pv          ) as detail_show_pv --详情页曝光次数
      ,sum(play_pv                 ) as play_pv        --视频播放次数
      ,sum(first_play_pv           ) as first_play_pv  --视频首播次数(剔除循环播放)
      ,sum(full_play_pv            ) as full_play_pv   --视频完播次数
      ,sum(distinct if(tab_show_pv   >0,t1.hdid,null)) as tab_show_uv    --子tab曝光用户数
      ,sum(distinct if(cover_show_pv >0,t1.hdid,null)) as cover_show_uv  --视频封面曝光用户数 
      ,sum(distinct if(cover_click_pv>0,t1.hdid,null)) as cover_click_uv --视频封面点击用户数
      ,sum(distinct if(detail_show_pv>0,t1.hdid,null)) as detail_show_uv --详情页曝光用户数
      ,sum(distinct if(play_pv       >0,t1.hdid,null)) as play_uv        --视频播放用户数
      ,sum(distinct if(first_play_pv >0,t1.hdid,null)) as first_play_uv  --视频首播用户数(剔除循环播放)
      ,sum(distinct if(full_play_pv  >0,t1.hdid,null)) as full_play_uv   --视频完播用户数
from t_device as t1 
join tmp_hdid_test_info   as t2 on t1.hdid=t2.hdid 
left join t_consume_event as t3 on t1.hdid=t3.hdid 
group by 
       t1.country
      ,t1.sys
      ,t1.ver
      ,t1.device_type
      ,t2.layer_id_name
      ,t2.test_id_name
      ,t3.page_source
      ,t1.if_click_push
with cube
--------------------------------------lateralViewExplode--------------------------------------
--v3
create temporary table t_device as
select 
dt,
hdid,
sys,
ver,
if(country in ('印度','印度尼西亚','巴西'),country,'其它') as country,
device_type,
if_click_push
from 
test.dwf_act_device_info_d 
where dt = '2019-06-25'
and regexp_extract(ver,'(\\d+\\.){1,2}(\\d+)',0)>= '1.4';
create temporary table tmp_hdid_test_info as 
select dt,hdid,test.layer_id,test.test_id,test.layer_id_name,test.test_id_name
from (
   select dt,hdid,test_info
   from test.dwf_act_device_info_d t
   where dt = '2019-06-25'
)q
lateral view explode(test_info) test_info_tb as test
group by dt,hdid,test.layer_id,test.test_id,test.layer_id_name,test.test_id_name;
create temporary table t_consume_event as
select  
        hdid
      ,event['page_source'] as page_source
      ,sum(if(event_id='10201-0006',1,0))                     as tab_show_pv    --子tab曝光次数
      ,sum(if(event_id='10201-0001',1,0))                     as cover_show_pv  --视频封面曝光次数 
      ,sum(if(event_id='10201-0002',1,0))                     as cover_click_pv --视频封面点击次数
      ,sum(if(event_id='10202-0001',1,0))                     as detail_show_pv --详情页曝光次数
      ,sum(if(event_id='10202-0002',1,0))                     as play_pv        --视频播放次数
      ,sum(if(event_id='10202-0002' and event['loop_cnt']='1',1,0))    as first_play_pv  --视频首播次数(剔除循环播放)
      ,sum(if(event_id='10202-0002' and event['report_type']='1',1,0)) as full_play_pv   --视频完播次数
from test.dwv_consumer_event_d
where  dt = '2019-06-25'  
  and  event_id  in ('10201-0006','10201-0001','10201-0002','10202-0001','10202-0002') 
  and  event['page_source'] in ('1','2','3')
group by hdid 
       ,event['page_source']
;
create temporary table tmp_base as 
select 
       '2019-06-25' as dt
      ,t1.hdid
      ,nvl(t1.country        ,'unknown') as country        
      ,nvl(t1.sys            ,'unknown') as sys            
      ,nvl(t1.ver            ,'unknown') as ver            
      ,nvl(t1.device_type    ,'unknown') as device_type    
      ,nvl(t2.layer_id_name  ,'unknown') as layer_id_name  
      ,nvl(t2.test_id_name   ,'unknown') as test_id_name   
      ,nvl(t3.page_source    ,'unknown') as page_source    
      ,nvl(t1.if_click_push  ,'unknown') as if_click_push 
      ,sum(tab_show_pv             ) as tab_show_pv    --子tab曝光次数
      ,sum(cover_show_pv           ) as cover_show_pv  --视频封面曝光次数 
      ,sum(cover_click_pv          ) as cover_click_pv --视频封面点击次数
      ,sum(detail_show_pv          ) as detail_show_pv --详情页曝光次数
      ,sum(play_pv                 ) as play_pv        --视频播放次数
      ,sum(first_play_pv           ) as first_play_pv  --视频首播次数(剔除循环播放)
      ,sum(full_play_pv            ) as full_play_pv   --视频完播次数
from t_device as t1 
join tmp_hdid_test_info   as t2 on t1.hdid=t2.hdid 
left join t_consume_event as t3 on t1.hdid=t3.hdid 
group by 
       t1.hdid
      ,nvl(t1.country        ,'unknown')
      ,nvl(t1.sys            ,'unknown')
      ,nvl(t1.ver            ,'unknown')
      ,nvl(t1.device_type    ,'unknown')
      ,nvl(t2.layer_id_name  ,'unknown')
      ,nvl(t2.test_id_name   ,'unknown')
      ,nvl(t3.page_source    ,'unknown')
      ,nvl(t1.if_click_push  ,'unknown')
;

--切记要把膨胀后的结果落地

create temporary table tmp_explode as 
select 
       hdid
      ,countrys      
      ,syss          
      ,vers          
      ,device_types  
      ,layer_id_names
      ,test_id_names 
      ,page_sources  
      ,if_click_pushs
      ,tab_show_pv   
      ,cover_show_pv 
      ,cover_click_pv
      ,detail_show_pv
      ,play_pv       
      ,first_play_pv 
      ,full_play_pv  
from tmp_base t1
lateral view explode(split(concat(country      ,'#all'),'#'))a as countrys      
lateral view explode(split(concat(sys          ,'#all'),'#'))a as syss          
lateral view explode(split(concat(ver          ,'#all'),'#'))a as vers          
lateral view explode(split(concat(device_type  ,'#all'),'#'))a as device_types  
lateral view explode(split(concat(layer_id_name,'#all'),'#'))a as layer_id_names
lateral view explode(split(concat(test_id_name ,'#all'),'#'))a as test_id_names 
lateral view explode(split(concat(page_source  ,'#all'),'#'))a as page_sources  
lateral view explode(split(concat(if_click_push,'#all'),'#'))a as if_click_pushs
;
select countrys      
      ,syss          
      ,vers          
      ,device_types  
      ,layer_id_names
      ,test_id_names 
      ,page_sources  
      ,if_click_pushs
      ,sum(tab_show_pv             ) as tab_show_pv    --子tab曝光次数
      ,sum(cover_show_pv           ) as cover_show_pv  --视频封面曝光次数 
      ,sum(cover_click_pv          ) as cover_click_pv --视频封面点击次数
      ,sum(detail_show_pv          ) as detail_show_pv --详情页曝光次数
      ,sum(play_pv                 ) as play_pv        --视频播放次数
      ,sum(first_play_pv           ) as first_play_pv  --视频首播次数(剔除循环播放)
      ,sum(full_play_pv            ) as full_play_pv   --视频完播次数
      ,sum(distinct if(tab_show_pv   >0,hdid,null)) as tab_show_uv    --子tab曝光用户数
      ,sum(distinct if(cover_show_pv >0,hdid,null)) as cover_show_uv  --视频封面曝光用户数 
      ,sum(distinct if(cover_click_pv>0,hdid,null)) as cover_click_uv --视频封面点击用户数
      ,sum(distinct if(detail_show_pv>0,hdid,null)) as detail_show_uv --详情页曝光用户数
      ,sum(distinct if(play_pv       >0,hdid,null)) as play_uv        --视频播放用户数
      ,sum(distinct if(first_play_pv >0,hdid,null)) as first_play_uv  --视频首播用户数(剔除循环播放)
      ,sum(distinct if(full_play_pv  >0,hdid,null)) as full_play_uv   --视频完播用户数
from tmp_explode
group by
       countrys      
      ,syss          
      ,vers          
      ,device_types  
      ,layer_id_names
      ,test_id_names 
      ,page_sources  
      ,if_click_pushs
;
--------------------------------------MiniSum--------------------------------------
--v4
create temporary table t_device as
select 
dt,
hdid,
sys,
ver,
if(country in ('印度','印度尼西亚','巴西'),country,'其它') as country,
device_type,
if_click_push
from 
test.dwf_act_device_info_d 
where dt = '2019-06-25'
and regexp_extract(ver,'(\\d+\\.){1,2}(\\d+)',0)>= '1.4';
create temporary table tmp_hdid_test_info as 
select dt,hdid,test.layer_id,test.test_id,test.layer_id_name,test.test_id_name
from (
   select dt,hdid,test_info
   from test.dwf_act_device_info_d t
   where dt = '2019-06-25'
)q
lateral view explode(test_info) test_info_tb as test
group by dt,hdid,test.layer_id,test.test_id,test.layer_id_name,test.test_id_name;
create temporary table t_consume_event as
select  
        hdid
      ,event['page_source'] as page_source
      ,sum(if(event_id='10201-0006',1,0))                     as tab_show_pv    --子tab曝光次数
      ,sum(if(event_id='10201-0001',1,0))                     as cover_show_pv  --视频封面曝光次数 
      ,sum(if(event_id='10201-0002',1,0))                     as cover_click_pv --视频封面点击次数
      ,sum(if(event_id='10202-0001',1,0))                     as detail_show_pv --详情页曝光次数
      ,sum(if(event_id='10202-0002',1,0))                     as play_pv        --视频播放次数
      ,sum(if(event_id='10202-0002' and event['loop_cnt']='1',1,0))    as first_play_pv  --视频首播次数(剔除循环播放)
      ,sum(if(event_id='10202-0002' and event['report_type']='1',1,0)) as full_play_pv   --视频完播次数
from test.dwv_consumer_event_d
where  dt = '2019-06-25'  
  and  event_id  in ('10201-0006','10201-0001','10201-0002','10202-0001','10202-0002') 
  and  event['page_source'] in ('1','2','3')
group by hdid 
       ,event['page_source']
;
with tmp_base as 
(
    select 
           '2019-06-25' as dt
          ,hdid
          ,country        
          ,sys            
          ,ver            
          ,device_type    
          ,nvl(layer_id_name  ,'all') as layer_id_name  
          ,nvl(test_id_name   ,'all') as test_id_name   
          ,nvl(page_source    ,'all') as page_source    
          ,if_click_push  
          ,sum(tab_show_pv             ) as tab_show_pv    --子tab曝光次数
          ,sum(cover_show_pv           ) as cover_show_pv  --视频封面曝光次数 
          ,sum(cover_click_pv          ) as cover_click_pv --视频封面点击次数
          ,sum(detail_show_pv          ) as detail_show_pv --详情页曝光次数
          ,sum(play_pv                 ) as play_pv        --视频播放次数
          ,sum(first_play_pv           ) as first_play_pv  --视频首播次数(剔除循环播放)
          ,sum(full_play_pv            ) as full_play_pv   --视频完播次数
    from ( 
            select
                 t1.hdid
                ,nvl(t1.country        ,'unknown') as country        
                ,nvl(t1.sys            ,'unknown') as sys            
                ,nvl(t1.ver            ,'unknown') as ver            
                ,nvl(t1.device_type    ,'unknown') as device_type    
                ,nvl(t2.layer_id_name  ,'unknown') as layer_id_name  
                ,nvl(t2.test_id_name   ,'unknown') as test_id_name   
                ,nvl(t3.page_source    ,'unknown') as page_source    
                ,nvl(t1.if_click_push  ,'unknown') as if_click_push 
                ,tab_show_pv    
                ,cover_show_pv  
                ,cover_click_pv 
                ,detail_show_pv 
                ,play_pv        
                ,first_play_pv  
                ,full_play_pv 
            from t_device as t1 
            join tmp_hdid_test_info   as t2 on t1.hdid=t2.hdid 
            left join t_consume_event as t3 on t1.hdid=t3.hdid   
         ) as t
    group by
           hdid
          ,country        
          ,sys            
          ,ver            
          ,device_type    
          ,layer_id_name  
          ,test_id_name   
          ,page_source    
          ,if_click_push 
    grouping sets (
                         (hdid,country,sys,ver,device_type,if_click_push,layer_id_name,test_id_name,page_source)
                        ,(hdid,country,sys,ver,device_type,if_click_push,layer_id_name,test_id_name)
                        ,(hdid,country,sys,ver,device_type,if_click_push,layer_id_name,page_source)
                        ,(hdid,country,sys,ver,device_type,if_click_push,test_id_name,page_source)
                        ,(hdid,country,sys,ver,device_type,if_click_push,page_source)
                        ,(hdid,country,sys,ver,device_type,if_click_push,layer_id_name)
                        ,(hdid,country,sys,ver,device_type,if_click_push,test_id_name)
                        ,(hdid,country,sys,ver,device_type,if_click_push)
                ) 
)
,tmp_explode as 
(
     select 
           '2019-06-25' as dt
          ,country        
          ,sys            
          ,ver            
          ,device_type    
          ,layer_id_name  
          ,test_id_name   
          ,page_source    
          ,if_click_push  
          ,sum(tab_show_pv             ) as tab_show_pv    --子tab曝光次数
          ,sum(cover_show_pv           ) as cover_show_pv  --视频封面曝光次数 
          ,sum(cover_click_pv          ) as cover_click_pv --视频封面点击次数
          ,sum(detail_show_pv          ) as detail_show_pv --详情页曝光次数
          ,sum(play_pv                 ) as play_pv        --视频播放次数
          ,sum(first_play_pv           ) as first_play_pv  --视频首播次数(剔除循环播放)
          ,sum(full_play_pv            ) as full_play_pv   --视频完播次数
          ,sum(if(tab_show_pv   >0,1,0)) as tab_show_uv    --子tab曝光用户数
          ,sum(if(cover_show_pv >0,1,0)) as cover_show_uv  --视频封面曝光用户数 
          ,sum(if(cover_click_pv>0,1,0)) as cover_click_uv --视频封面点击用户数
          ,sum(if(detail_show_pv>0,1,0)) as detail_show_uv --详情页曝光用户数
          ,sum(if(play_pv       >0,1,0)) as play_uv        --视频播放用户数
          ,sum(if(first_play_pv >0,1,0)) as first_play_uv  --视频首播用户数(剔除循环播放)
          ,sum(if(full_play_pv  >0,1,0)) as full_play_uv   --视频完播用户数
    from tmp_base
    group by 
           country        
          ,sys            
          ,ver            
          ,device_type    
          ,layer_id_name  
          ,test_id_name   
          ,page_source    
          ,if_click_push 
)
select countrys      
      ,syss          
      ,vers          
      ,device_types  
      ,layer_id_name
      ,test_id_name 
      ,page_source  
      ,if_click_pushs
      ,sum(tab_show_pv             ) as tab_show_pv    --子tab曝光次数
      ,sum(cover_show_pv           ) as cover_show_pv  --视频封面曝光次数 
      ,sum(cover_click_pv          ) as cover_click_pv --视频封面点击次数
      ,sum(detail_show_pv          ) as detail_show_pv --详情页曝光次数
      ,sum(play_pv                 ) as play_pv        --视频播放次数
      ,sum(first_play_pv           ) as first_play_pv  --视频首播次数(剔除循环播放)
      ,sum(full_play_pv            ) as full_play_pv   --视频完播次数
      ,sum(tab_show_uv             ) as tab_show_uv    --子tab曝光用户数
      ,sum(cover_show_uv           ) as cover_show_uv  --视频封面曝光用户数 
      ,sum(cover_click_uv          ) as cover_click_uv --视频封面点击用户数
      ,sum(detail_show_uv          ) as detail_show_uv --详情页曝光用户数
      ,sum(play_uv                 ) as play_uv        --视频播放用户数
      ,sum(first_play_uv           ) as first_play_uv  --视频首播用户数(剔除循环播放)
      ,sum(full_play_uv            ) as full_play_uv   --视频完播用户数
from tmp_explode
lateral view explode(split(concat(country      ,'#all'),'#'))a as countrys      
lateral view explode(split(concat(sys          ,'#all'),'#'))a as syss          
lateral view explode(split(concat(ver          ,'#all'),'#'))a as vers          
lateral view explode(split(concat(device_type  ,'#all'),'#'))a as device_types  
lateral view explode(split(concat(if_click_push,'#all'),'#'))a as if_click_pushs
group by
       countrys      
      ,syss          
      ,vers          
      ,device_types  
      ,layer_id_name
      ,test_id_name 
      ,page_source  
      ,if_click_pushs
;