1 问题描述
有一张日志表log表(memberid,pvtime),和会员表memberid(memberid),统计每一个会员总浏览时长。
2 问题分析
求解该问题时一般先用日志表与会员表进行join过滤出会员的日志信息,但是在过滤日志时进行join时候,由于每个会员活跃程度不一样,出现部分会员非常活跃,导致关联时key分布不均出现数据倾斜。此时呢,会员表相对于日志表来说比较小,不是很大,但是走mapjoin的话资源又不足,此时我们给出一种比较通用的解法倍数小表法(将每个memberid数据扩容,相当于表扩容)。表扩容(key扩容)我们采用lateral view explode() 的方法,如下将小表扩容10倍
lateral view explode(array(1,2,3,4,5,6,7,8,9,10)) mytable as rd --生成随机数用于均匀发往reduce中
第一步生成倍数表
select memberid, rd
from members d
lateral view explode(array(1,2,3,4,5,6,7,8,9,10)) mytable as rd
第二步:日志表与会员表关联。
日志表与会员表关联时候需要按照随机数均匀进行分散,发往各个redcue中,所以关联条件中需要加入随机数条件。
select t1.memberid memberid
,t2.pvtime pvtime
,t1.rd rd
from
(select memberid,pvtime,cast(rand()*10 as int) + 1 as rd
from log
) t2
join
(select memberid, rd
from members d
lateral view explode(array(1,2,3,4,5,6,7,8,9,10)) mytable as rd
) t1
on t1.memberid = t2.memberid and t1.rd = t2.rd
或者
select t1.memberid memberid
,t2.pvtime pvtime
,t1.rd rd
from
log t2
join
(select memberid, rd
from members d
lateral view explode(array(1,2,3,4,5,6,7,8,9,10)) mytable as rd
) t1
on t1.memberid = t2.memberid and t1.rd = pmod(t2.pvtime,10)+1 --思考pmod()函数用法
第三步:按照随机数部分分组汇总(优化group by)
select rd,memberid,sum(pvtime) pvtime
from(
select t1.memberid memberid
,t2.pvtime pvtime
,t1.rd rd
from
(select memberid,pvtime,cast(rand()*10 as int) + 1 as rd
from log
) t2
join
(select memberid, rd
from members d
lateral view explode(array(1,2,3,4,5,6,7,8,9,10)) mytable as rd
) t1
on t1.memberid = t2.memberid and t1.rd = t2.rd
) t
group by rd,memberid --一定注意rd随机数在前,memberid在后,思考为什么?
第四步:按会员id整体汇总
select memberid,sum(pvtime) pvtime
from
(
select rd,memberid,sum(pvtime) pvtime
from(
select t1.memberid memberid
,t2.pvtime pvtime
,t1.rd rd
from
(select memberid,pvtime,cast(rand()*10 as int) + 1 as rd
from log
) t2
join
(select memberid, rd
from members d
lateral view explode(array(1,2,3,4,5,6,7,8,9,10)) mytable as rd
) t1
on t1.memberid = t2.memberid and t1.rd = t2.rd
) t
group by rd,memberid --一定注意rd随机数在前,memberid在后,思考为什么?
) t
group by memberid
2 小结
本文给出了一种利用倍数小表优化数据倾斜的一种通用方法,该方法适用场景为数据倾斜时不能用mapjoin的时候,也就是集群资源不足时候,通过该方法能够有效缓解数据倾斜,但不能根除数据倾斜。处理数据倾斜最好的方法还是采用分治思想,利用mapjoin一分为二的处理,倾斜key单独走mapjoin,非倾斜key走reduce join,最终将数据union all起来,采用mapjoin的方法切断了shuffle过程,也就没有数据倾斜这一说,缺点耗资源,前提是集群资源足够条件下,且满足mapjoin的条件。