在短视频项目开发中曾经遇到过一个SQL优化的问题,SQL是实现多维报表的统计,有8个维度,有曝光、点击、播放几种行为的pv、uv统计,此类需求也比较常见。
需求如下:
需求有9个维度,报表是每天统计,日期等于统计周期,常量,可以不考虑,剩下还有8个维度(国家、系统、版本、用户类型、实验名称、实验分组、行为入口、是否点击push)。
实现这个需求最常见的写法是groupby with cube,实现的伪代码如下:
这种写法在数据量小维度少的场景是行之有效的,但是数据量大维度多的时候就会跑的很吃力。尝试第一种写法,执行了4个小时30分钟后,SQL执行记录如下所示:
为什么需要执行这么长时间,来看看数据量,符合统计需求的事件记录数是4.38
亿条,每个用户平均价会有8
(
21486346/2658004=8.08
)个实验,大致估算基础数据量有35
亿,而维度有8
个,也就是需要256(2
的8
次方)
次group by
运算,每次group by
还有几个count distinct
,的确很吃力。
由于需要进行256次group by,成本比较高,尝试第二种写法,通过lateral view explode先把基础数据膨胀后直接进行group by,实现的伪代码如下:
Lateral view explode写法执行时间2小时10分钟,相比group bywithcube写法提升一倍,但是2小时还是比较长。SQL执行记录如下所示:
有一点特别要注意的是,明细数据膨胀完一定要数据落地,落地之后数据会重分布,如果数据没有落地,执行时间是7小时35分钟,比原来group by with cube写法还要缓慢。
由于之前有kylin的项目经验,当时做优化时就想是否可参考kylin的分层叠加的做法来实现需求,kylin通过逐层构建的方法实现整个cube的构建。
整个立方体称为1个cube,立方体中每个网格点称为1个cuboid,图中(time,item,location,supplier)和(time,supplier)都是cuboid,最多维度的组合(time,item,location,supplier)称为Basecuboid。cube的计算过程是逐层计算的,首先计算Base cuboid,然后计算维度数依次减少,逐层向上计算每层的cuboid。
现实遇到的2个问题:1、参照kylin的做法使用逐层叠加的方式会遇到去重的问题,由于kylin有bitmap实现精准去重,应该如何处理不同层的去重问题;2、由于8个层次,如果逐层叠加,SQL会非常长,可读性也是要考虑的问题。
改进后的方法是尝试先找到最小可叠加cube,最小可叠加cube以上的维度是用户数是可以直接进行求和(sum)计算的,执行结果和明细数据直接countdistinct的结果必须是一致的,最小可叠加维度以下的维度可以如果少可以通过groupingsets计算得到。
在这个需求中用户数会引发去重的问题,对于可叠加维度,合计的度量值等于各位维度的度量值相加,而不可叠加维度,合计的度量值小于各位维度的度量值相加。例如国家维度,一个用户只属于一个国家,那么在计算所有国家的用户就可以使用各个国家的用户直接相加;而入口维度,一个用户在多个入口都会有记录,在算总入口用户数时就无法直接累加。
分析之后,可叠加维度有:国家、系统、版本、用户类型(新、老)、是否点击push,而不可叠加维度有:实验名称、实验分组、行为入口,那么先实现不可叠加维度的分组汇总,再计算可叠加的统计值。
实现的伪代码如下所示:
实际的sql改造后执行时间仅为15分钟不到,效率有本质上的提升。SQL执行记录如下所示:
在相同集群相同队列的情况,三种写法的执行时间如下所示:
写法 | 执行时间 |
groupByWithCube | 4小时30分钟 |
lateralViewExplode | 2小时10分钟 |
MiniSum | 15分钟 |
groupByWithCube的写法对于数据量小维度少的情况是使用的,lateralViewExplode写法相比groupByWithCube性能有较大的提升,而最小可叠加cube的方法在数据量大维度多的情况是可以带来质的提升。
最后附上三种写法的SQL,已经对库名和表名做了模糊处理:
--------------------------------------groupByWithCube--------------------------------------
set hive.new.job.grouping.set.cardinality=1024;
create temporary table t_device as
select
dt,
hdid,
sys,
ver,
if(country in ('印度','印度尼西亚','巴西'),country,'其它') as country,
device_type,
if_click_push
from test.dwf_act_device_info_d
where dt = '2019-06-25' and regexp_extract(ver,'(\\d+\\.){1,2}(\\d+)',0)>= '1.4';
create temporary table tmp_hdid_test_info as
select dt,hdid,test.layer_id,test.test_id,test.layer_id_name,test.test_id_name
from ( select dt,hdid,test_info
from test.dwf_act_device_info_d t
where dt = '2019-06-25' )q
lateral view explode(test_info) test_info_tb as test
group by dt,hdid,test.layer_id,test.test_id,test.layer_id_name,test.test_id_name;
create temporary table t_consume_event as
select
hdid
,event['page_source'] as page_source
,sum(if(event_id='10201-0006',1,0)) as tab_show_pv --子tab曝光次数
,sum(if(event_id='10201-0001',1,0)) as cover_show_pv --视频封面曝光次数
,sum(if(event_id='10201-0002',1,0)) as cover_click_pv --视频封面点击次数
,sum(if(event_id='10202-0001',1,0)) as detail_show_pv --详情页曝光次数
,sum(if(event_id='10202-0002',1,0)) as play_pv --视频播放次数
,sum(if(event_id='10202-0002' and event['loop_cnt']='1',1,0)) as first_play_pv --视频首播次数(剔除循环播放)
,sum(if(event_id='10202-0002' and event['report_type']='1',1,0)) as full_play_pv --视频完播次数
from test.dwv_consumer_event_d
where dt = '2019-06-25'
and event_id in ('10201-0006','10201-0001','10201-0002','10202-0001','10202-0002')
and event['page_source'] in ('1','2','3')
group by hdid
,event['page_source']
;
select '2019-06-25' as dt
,nvl(t1.country ,'all') as country
,nvl(t1.sys ,'all') as sys
,nvl(t1.ver ,'all') as ver
,nvl(t1.device_type ,'all') as device_type
,nvl(t2.layer_id_name ,'all') as layer_id_name
,nvl(t2.test_id_name ,'all') as test_id_name
,nvl(t3.page_source ,'all') as page_source
,nvl(t1.if_click_push ,'all') as if_click_push
,sum(tab_show_pv ) as tab_show_pv --子tab曝光次数
,sum(cover_show_pv ) as cover_show_pv --视频封面曝光次数
,sum(cover_click_pv ) as cover_click_pv --视频封面点击次数
,sum(detail_show_pv ) as detail_show_pv --详情页曝光次数
,sum(play_pv ) as play_pv --视频播放次数
,sum(first_play_pv ) as first_play_pv --视频首播次数(剔除循环播放)
,sum(full_play_pv ) as full_play_pv --视频完播次数
,sum(distinct if(tab_show_pv >0,t1.hdid,null)) as tab_show_uv --子tab曝光用户数
,sum(distinct if(cover_show_pv >0,t1.hdid,null)) as cover_show_uv --视频封面曝光用户数
,sum(distinct if(cover_click_pv>0,t1.hdid,null)) as cover_click_uv --视频封面点击用户数
,sum(distinct if(detail_show_pv>0,t1.hdid,null)) as detail_show_uv --详情页曝光用户数
,sum(distinct if(play_pv >0,t1.hdid,null)) as play_uv --视频播放用户数
,sum(distinct if(first_play_pv >0,t1.hdid,null)) as first_play_uv --视频首播用户数(剔除循环播放)
,sum(distinct if(full_play_pv >0,t1.hdid,null)) as full_play_uv --视频完播用户数
from t_device as t1
join tmp_hdid_test_info as t2 on t1.hdid=t2.hdid
left join t_consume_event as t3 on t1.hdid=t3.hdid
group by
t1.country
,t1.sys
,t1.ver
,t1.device_type
,t2.layer_id_name
,t2.test_id_name
,t3.page_source
,t1.if_click_push
with cube
--------------------------------------lateralViewExplode--------------------------------------
--v3
create temporary table t_device as
select
dt,
hdid,
sys,
ver,
if(country in ('印度','印度尼西亚','巴西'),country,'其它') as country,
device_type,
if_click_push
from
test.dwf_act_device_info_d
where dt = '2019-06-25'
and regexp_extract(ver,'(\\d+\\.){1,2}(\\d+)',0)>= '1.4';
create temporary table tmp_hdid_test_info as
select dt,hdid,test.layer_id,test.test_id,test.layer_id_name,test.test_id_name
from (
select dt,hdid,test_info
from test.dwf_act_device_info_d t
where dt = '2019-06-25'
)q
lateral view explode(test_info) test_info_tb as test
group by dt,hdid,test.layer_id,test.test_id,test.layer_id_name,test.test_id_name;
create temporary table t_consume_event as
select
hdid
,event['page_source'] as page_source
,sum(if(event_id='10201-0006',1,0)) as tab_show_pv --子tab曝光次数
,sum(if(event_id='10201-0001',1,0)) as cover_show_pv --视频封面曝光次数
,sum(if(event_id='10201-0002',1,0)) as cover_click_pv --视频封面点击次数
,sum(if(event_id='10202-0001',1,0)) as detail_show_pv --详情页曝光次数
,sum(if(event_id='10202-0002',1,0)) as play_pv --视频播放次数
,sum(if(event_id='10202-0002' and event['loop_cnt']='1',1,0)) as first_play_pv --视频首播次数(剔除循环播放)
,sum(if(event_id='10202-0002' and event['report_type']='1',1,0)) as full_play_pv --视频完播次数
from test.dwv_consumer_event_d
where dt = '2019-06-25'
and event_id in ('10201-0006','10201-0001','10201-0002','10202-0001','10202-0002')
and event['page_source'] in ('1','2','3')
group by hdid
,event['page_source']
;
create temporary table tmp_base as
select
'2019-06-25' as dt
,t1.hdid
,nvl(t1.country ,'unknown') as country
,nvl(t1.sys ,'unknown') as sys
,nvl(t1.ver ,'unknown') as ver
,nvl(t1.device_type ,'unknown') as device_type
,nvl(t2.layer_id_name ,'unknown') as layer_id_name
,nvl(t2.test_id_name ,'unknown') as test_id_name
,nvl(t3.page_source ,'unknown') as page_source
,nvl(t1.if_click_push ,'unknown') as if_click_push
,sum(tab_show_pv ) as tab_show_pv --子tab曝光次数
,sum(cover_show_pv ) as cover_show_pv --视频封面曝光次数
,sum(cover_click_pv ) as cover_click_pv --视频封面点击次数
,sum(detail_show_pv ) as detail_show_pv --详情页曝光次数
,sum(play_pv ) as play_pv --视频播放次数
,sum(first_play_pv ) as first_play_pv --视频首播次数(剔除循环播放)
,sum(full_play_pv ) as full_play_pv --视频完播次数
from t_device as t1
join tmp_hdid_test_info as t2 on t1.hdid=t2.hdid
left join t_consume_event as t3 on t1.hdid=t3.hdid
group by
t1.hdid
,nvl(t1.country ,'unknown')
,nvl(t1.sys ,'unknown')
,nvl(t1.ver ,'unknown')
,nvl(t1.device_type ,'unknown')
,nvl(t2.layer_id_name ,'unknown')
,nvl(t2.test_id_name ,'unknown')
,nvl(t3.page_source ,'unknown')
,nvl(t1.if_click_push ,'unknown')
;
--切记要把膨胀后的结果落地
create temporary table tmp_explode as
select
hdid
,countrys
,syss
,vers
,device_types
,layer_id_names
,test_id_names
,page_sources
,if_click_pushs
,tab_show_pv
,cover_show_pv
,cover_click_pv
,detail_show_pv
,play_pv
,first_play_pv
,full_play_pv
from tmp_base t1
lateral view explode(split(concat(country ,'#all'),'#'))a as countrys
lateral view explode(split(concat(sys ,'#all'),'#'))a as syss
lateral view explode(split(concat(ver ,'#all'),'#'))a as vers
lateral view explode(split(concat(device_type ,'#all'),'#'))a as device_types
lateral view explode(split(concat(layer_id_name,'#all'),'#'))a as layer_id_names
lateral view explode(split(concat(test_id_name ,'#all'),'#'))a as test_id_names
lateral view explode(split(concat(page_source ,'#all'),'#'))a as page_sources
lateral view explode(split(concat(if_click_push,'#all'),'#'))a as if_click_pushs
;
select countrys
,syss
,vers
,device_types
,layer_id_names
,test_id_names
,page_sources
,if_click_pushs
,sum(tab_show_pv ) as tab_show_pv --子tab曝光次数
,sum(cover_show_pv ) as cover_show_pv --视频封面曝光次数
,sum(cover_click_pv ) as cover_click_pv --视频封面点击次数
,sum(detail_show_pv ) as detail_show_pv --详情页曝光次数
,sum(play_pv ) as play_pv --视频播放次数
,sum(first_play_pv ) as first_play_pv --视频首播次数(剔除循环播放)
,sum(full_play_pv ) as full_play_pv --视频完播次数
,sum(distinct if(tab_show_pv >0,hdid,null)) as tab_show_uv --子tab曝光用户数
,sum(distinct if(cover_show_pv >0,hdid,null)) as cover_show_uv --视频封面曝光用户数
,sum(distinct if(cover_click_pv>0,hdid,null)) as cover_click_uv --视频封面点击用户数
,sum(distinct if(detail_show_pv>0,hdid,null)) as detail_show_uv --详情页曝光用户数
,sum(distinct if(play_pv >0,hdid,null)) as play_uv --视频播放用户数
,sum(distinct if(first_play_pv >0,hdid,null)) as first_play_uv --视频首播用户数(剔除循环播放)
,sum(distinct if(full_play_pv >0,hdid,null)) as full_play_uv --视频完播用户数
from tmp_explode
group by
countrys
,syss
,vers
,device_types
,layer_id_names
,test_id_names
,page_sources
,if_click_pushs
;
--------------------------------------MiniSum--------------------------------------
--v4
create temporary table t_device as
select
dt,
hdid,
sys,
ver,
if(country in ('印度','印度尼西亚','巴西'),country,'其它') as country,
device_type,
if_click_push
from
test.dwf_act_device_info_d
where dt = '2019-06-25'
and regexp_extract(ver,'(\\d+\\.){1,2}(\\d+)',0)>= '1.4';
create temporary table tmp_hdid_test_info as
select dt,hdid,test.layer_id,test.test_id,test.layer_id_name,test.test_id_name
from (
select dt,hdid,test_info
from test.dwf_act_device_info_d t
where dt = '2019-06-25'
)q
lateral view explode(test_info) test_info_tb as test
group by dt,hdid,test.layer_id,test.test_id,test.layer_id_name,test.test_id_name;
create temporary table t_consume_event as
select
hdid
,event['page_source'] as page_source
,sum(if(event_id='10201-0006',1,0)) as tab_show_pv --子tab曝光次数
,sum(if(event_id='10201-0001',1,0)) as cover_show_pv --视频封面曝光次数
,sum(if(event_id='10201-0002',1,0)) as cover_click_pv --视频封面点击次数
,sum(if(event_id='10202-0001',1,0)) as detail_show_pv --详情页曝光次数
,sum(if(event_id='10202-0002',1,0)) as play_pv --视频播放次数
,sum(if(event_id='10202-0002' and event['loop_cnt']='1',1,0)) as first_play_pv --视频首播次数(剔除循环播放)
,sum(if(event_id='10202-0002' and event['report_type']='1',1,0)) as full_play_pv --视频完播次数
from test.dwv_consumer_event_d
where dt = '2019-06-25'
and event_id in ('10201-0006','10201-0001','10201-0002','10202-0001','10202-0002')
and event['page_source'] in ('1','2','3')
group by hdid
,event['page_source']
;
with tmp_base as
(
select
'2019-06-25' as dt
,hdid
,country
,sys
,ver
,device_type
,nvl(layer_id_name ,'all') as layer_id_name
,nvl(test_id_name ,'all') as test_id_name
,nvl(page_source ,'all') as page_source
,if_click_push
,sum(tab_show_pv ) as tab_show_pv --子tab曝光次数
,sum(cover_show_pv ) as cover_show_pv --视频封面曝光次数
,sum(cover_click_pv ) as cover_click_pv --视频封面点击次数
,sum(detail_show_pv ) as detail_show_pv --详情页曝光次数
,sum(play_pv ) as play_pv --视频播放次数
,sum(first_play_pv ) as first_play_pv --视频首播次数(剔除循环播放)
,sum(full_play_pv ) as full_play_pv --视频完播次数
from (
select
t1.hdid
,nvl(t1.country ,'unknown') as country
,nvl(t1.sys ,'unknown') as sys
,nvl(t1.ver ,'unknown') as ver
,nvl(t1.device_type ,'unknown') as device_type
,nvl(t2.layer_id_name ,'unknown') as layer_id_name
,nvl(t2.test_id_name ,'unknown') as test_id_name
,nvl(t3.page_source ,'unknown') as page_source
,nvl(t1.if_click_push ,'unknown') as if_click_push
,tab_show_pv
,cover_show_pv
,cover_click_pv
,detail_show_pv
,play_pv
,first_play_pv
,full_play_pv
from t_device as t1
join tmp_hdid_test_info as t2 on t1.hdid=t2.hdid
left join t_consume_event as t3 on t1.hdid=t3.hdid
) as t
group by
hdid
,country
,sys
,ver
,device_type
,layer_id_name
,test_id_name
,page_source
,if_click_push
grouping sets (
(hdid,country,sys,ver,device_type,if_click_push,layer_id_name,test_id_name,page_source)
,(hdid,country,sys,ver,device_type,if_click_push,layer_id_name,test_id_name)
,(hdid,country,sys,ver,device_type,if_click_push,layer_id_name,page_source)
,(hdid,country,sys,ver,device_type,if_click_push,test_id_name,page_source)
,(hdid,country,sys,ver,device_type,if_click_push,page_source)
,(hdid,country,sys,ver,device_type,if_click_push,layer_id_name)
,(hdid,country,sys,ver,device_type,if_click_push,test_id_name)
,(hdid,country,sys,ver,device_type,if_click_push)
)
)
,tmp_explode as
(
select
'2019-06-25' as dt
,country
,sys
,ver
,device_type
,layer_id_name
,test_id_name
,page_source
,if_click_push
,sum(tab_show_pv ) as tab_show_pv --子tab曝光次数
,sum(cover_show_pv ) as cover_show_pv --视频封面曝光次数
,sum(cover_click_pv ) as cover_click_pv --视频封面点击次数
,sum(detail_show_pv ) as detail_show_pv --详情页曝光次数
,sum(play_pv ) as play_pv --视频播放次数
,sum(first_play_pv ) as first_play_pv --视频首播次数(剔除循环播放)
,sum(full_play_pv ) as full_play_pv --视频完播次数
,sum(if(tab_show_pv >0,1,0)) as tab_show_uv --子tab曝光用户数
,sum(if(cover_show_pv >0,1,0)) as cover_show_uv --视频封面曝光用户数
,sum(if(cover_click_pv>0,1,0)) as cover_click_uv --视频封面点击用户数
,sum(if(detail_show_pv>0,1,0)) as detail_show_uv --详情页曝光用户数
,sum(if(play_pv >0,1,0)) as play_uv --视频播放用户数
,sum(if(first_play_pv >0,1,0)) as first_play_uv --视频首播用户数(剔除循环播放)
,sum(if(full_play_pv >0,1,0)) as full_play_uv --视频完播用户数
from tmp_base
group by
country
,sys
,ver
,device_type
,layer_id_name
,test_id_name
,page_source
,if_click_push
)
select countrys
,syss
,vers
,device_types
,layer_id_name
,test_id_name
,page_source
,if_click_pushs
,sum(tab_show_pv ) as tab_show_pv --子tab曝光次数
,sum(cover_show_pv ) as cover_show_pv --视频封面曝光次数
,sum(cover_click_pv ) as cover_click_pv --视频封面点击次数
,sum(detail_show_pv ) as detail_show_pv --详情页曝光次数
,sum(play_pv ) as play_pv --视频播放次数
,sum(first_play_pv ) as first_play_pv --视频首播次数(剔除循环播放)
,sum(full_play_pv ) as full_play_pv --视频完播次数
,sum(tab_show_uv ) as tab_show_uv --子tab曝光用户数
,sum(cover_show_uv ) as cover_show_uv --视频封面曝光用户数
,sum(cover_click_uv ) as cover_click_uv --视频封面点击用户数
,sum(detail_show_uv ) as detail_show_uv --详情页曝光用户数
,sum(play_uv ) as play_uv --视频播放用户数
,sum(first_play_uv ) as first_play_uv --视频首播用户数(剔除循环播放)
,sum(full_play_uv ) as full_play_uv --视频完播用户数
from tmp_explode
lateral view explode(split(concat(country ,'#all'),'#'))a as countrys
lateral view explode(split(concat(sys ,'#all'),'#'))a as syss
lateral view explode(split(concat(ver ,'#all'),'#'))a as vers
lateral view explode(split(concat(device_type ,'#all'),'#'))a as device_types
lateral view explode(split(concat(if_click_push,'#all'),'#'))a as if_click_pushs
group by
countrys
,syss
,vers
,device_types
,layer_id_name
,test_id_name
,page_source
,if_click_pushs
;