HiveSQL常用技巧
- 一、去重技巧 group by <-- distinct
- 在2019年购买后又退款的用户
- 二、聚合技巧--利用窗口函数grouping sets/cube/rollup
- 1. grouping sets
- 用户性别分布及每个性格的城市分布
- 2. cube
- 性别、城市、等级的各种组合的用户分布
- 3. rollup
- 同时计算出每个月的支付金额,以及每年的支付金额
- 三、union all 时可以开启并发执行
- 每个用户的支付和退款金额汇总
- 四、使用lateral view 进行行列转换
- 每个品类的购买用户数
- 五、表连接优化
- 六、解决数据倾斜
- 数据倾斜表现
- 数据倾斜产生原因和解决办法
- 七、如何计算按月累计去重
- 用sum() over() 计算按一定周期进行累计求和,如何计算按月累计去重?
一、去重技巧 group by <-- distinct
###取出user_trade表中全部支付用户
原来写法:
select distinct user_name
from user_trade
where dt >'0';
优化写法:
select user_name
from user_trade
where dt >'0'
group by user_name;
执行时长对比:
原有写法:
优化写法:
在2019年购买后又退款的用户
原有写法:
select a.user_name
from
(select distinct user_name
from user_trade
where year(dt)=2019) a
join
(select distict user_name
from user_refund
where year(dt)=2019 ) b on a.user_name= b.name;
优化写法:
select a.user_name
from
(select user_name
from user_trade
where year(dt)=2019
group by user_name ) a
join
(select user_name
from user_refund
where year(dt)=2019
group by user_name ) b on a.user_name= b.name;
注意:使用场景仅限于去重,不可以用在去重计算中count(distinct ** )
当数据量比较大且重复值比较多的时候,使用group by 去重,再count() 计数,比count(distinct ** )效率高
二、聚合技巧–利用窗口函数grouping sets/cube/rollup
1. grouping sets
用户的性别分布、城市分布、等级分布
常规写法:
性别分布:select sex, count(distinct user_id) from user_info group by sex;
城市分布:
```
select city,
count(distinct user_id)
from user_info
group by city;
```
等级分布:
```
select level,
count(distinct user_id)
from user_info
group by level;
```
缺点:分别写三次SQL,要执行三次,重复工作费时
优化写法:
select sex,
city,
level,
count(distinct user_id)
from user_info
grouping sets(sex,city,level);
等价于将不同维度的group by 结果进行union all
用户性别分布及每个性格的城市分布
常规写法:
性别分布
select sex,
count(distinct user_id)
from user_info
group by sex;
每个性别的城市分布:
select sex,
city,
count(distinct user_id)
from user_info
group by sex, city;
优化写法:
select sex,
city,
count(distinct user_id)
from user_info
group by sex, city
grouping sets (set,(set,city));
注意:第二列为空即为性别的分布;
2. cube
cube:根据group by维度的所有组合进行聚合
性别、城市、等级的各种组合的用户分布
常规写法:
select sex,
city,
level,
count(distinct user_id)
from user_info
group by sex, city, level
grouping sets (set,city,level, (set,city),(sex,level),(city,level),(sex,city,level));
优化写法:
select sex,
city,
level,
count(distinct user_id)
from user_info
group by sex, city, level
with cube;
3. rollup
rollup: 以最左侧的维度为主,进行层级聚合,是cube的子集
同时计算出每个月的支付金额,以及每年的支付金额
常规写法:
select a.dt,
sum(a.year_amount),
sum(a.month_amount)
from
(select substr(dt,1,4) as dt,
sum(pay_amount) as year_amount,
0 as month_amount
from user_trade
where dt>'0'
gourp by substr(dt,1,4)
union all
select substr(dt,1,7) as dt,
0 as year_amount,
sum(pay_amount) as month_amount
from user_trade
where dt>'0'
group by substr(dt,1,7) ) a
group by a.dt;
优化写法:
select year(dt) as year,
month(dt) as month,
sum(pay_amount)
from user_trade
where dt>'0'
group by year(dt),month(dt)
with rollup;
三、union all 时可以开启并发执行
参数设置:set hive.exec.parallel=true 开启并发执行,提高执行效率
每个用户的支付和退款金额汇总
select a.user_name,
a.pay_amount,
a.refund_amount
from
(select user_name,
sum(pay_amount) as pay_amount,
0 as refund_amount
from user_trade
where dt>'0'
group by user_name
union all
select user_name,
0 as pay_amound,
sum(refund_amount) as refund_amount
from user_refund
where dt>'0'
group by user_name) a
group by a.user_name;
时间对比:
未开启并发:
开启并发:
四、使用lateral view 进行行列转换
每个品类的购买用户数
select b.category,
count(distinct a.user_name)
from user_goods_category a
lateral view explode(split(category_detail,',')) b as category
group by b.category;
explode:行转列函数
concat_ws(’,’ , collect_set(column)) :列转行函数
五、表连接优化
- 小表在前,大表在后
hive 默认查询最后一个表是大表,会将其他表缓存起来,然后扫描最后那个表 - 使用相同的连接键
当对3个或以上表进行join连接时,如果每个on 都使用相同的连接键,那么只会产生一个MapReduce job - 尽早过滤数据
- 逻辑复杂时 引入中间表
六、解决数据倾斜
数据倾斜表现
任务进度长时间维持在99%(或者100%),任务监控页面,只有少量(1个或者几个)reduce子任务未完成。因为其处理的数据量和其他reduce差异比较大
数据倾斜产生原因和解决办法
- 空值产生的数据倾斜
解决:如果两个表连接时,使用 的连接条件有很多空值时,建议在连接条件中增加过滤
列如:on a.user_id=b.user_id and a.user_id is not null - 大小表连接(一张表很大,另一张表很小)
解决:将小表放在内存中,在map端做join
select b.*
from a join b on a.*=b.*;
- 两个表连接条件的字段数据类型不一致
解决:两个表连接条件的字段数据类型转换成一致的
列如:on a.user_id = cast(b.user_id as string)
七、如何计算按月累计去重
用sum() over() 计算按一定周期进行累计求和,如何计算按月累计去重?
2017年、2018年按月累计去重的购买用户数
select b.year,
b.month,
sum(b.user_nem) over (partition by b.year order by b.month)
from
(select a.year,
a.month,
count(distinct a.user_name) user_num
from
(select year(dt) as dt,
user_name,
min(month(dt)) as month
from user_trade
where year(dt) in (2017,2018)
group by year(dt), user_name ) a
group by a.year, a.month ) b
order by b.year, b.month
limit 24;
第二种写法:
set hive.mapred.mode=nonstrict;
select
from
(select substr(dt,1,7 ) as month,
user_name
from user_trade
where year(dt) in (2017,2018)
group by substr(dt,1,7 ) as month,
user_name) a
cross join
(select
from user_trade
where )