hive sql 原理 hive sql教程

转载

lazihuman 2023-07-12 21:24:38

文章标签 hive sql 原理 hive 大数据 ci 数据倾斜 文章分类 Hive 大数据

HiveSQL常用技巧

一、去重技巧 group by <-- distinct

在2019年购买后又退款的用户

二、聚合技巧--利用窗口函数grouping sets/cube/rollup

1. grouping sets
用户性别分布及每个性格的城市分布
2. cube
性别、城市、等级的各种组合的用户分布
3. rollup
同时计算出每个月的支付金额，以及每年的支付金额

三、union all 时可以开启并发执行

每个用户的支付和退款金额汇总

四、使用lateral view 进行行列转换

每个品类的购买用户数

五、表连接优化
六、解决数据倾斜

数据倾斜表现
数据倾斜产生原因和解决办法

七、如何计算按月累计去重

用sum() over() 计算按一定周期进行累计求和，如何计算按月累计去重？

一、去重技巧 group by <-- distinct

###取出user_trade表中全部支付用户
原来写法：

select distinct user_name
from user_trade
where dt >'0';

优化写法：

select user_name
from user_trade
where dt >'0'
group by user_name;

执行时长对比：

原有写法：

hive sql 原理 hive sql教程_数据倾斜

优化写法：

hive sql 原理 hive sql教程_大数据_02

在2019年购买后又退款的用户

原有写法：

select a.user_name
from
	(select distinct  user_name
	from user_trade
	where year(dt)=2019) a 
	join 
	(select distict user_name
	from user_refund 
	where year(dt)=2019 ) b on a.user_name= b.name;

优化写法：

select a.user_name
from
	(select   user_name
	from user_trade
	where year(dt)=2019
	group by user_name ) a 
	join 
	(select  user_name
	from user_refund 
	where year(dt)=2019
	group by user_name ) b on a.user_name= b.name;

注意：使用场景仅限于去重，不可以用在去重计算中count(distinct ** )
当数据量比较大且重复值比较多的时候，使用group by 去重，再count() 计数，比count(distinct ** )效率高

二、聚合技巧–利用窗口函数grouping sets/cube/rollup

1. grouping sets

用户的性别分布、城市分布、等级分布
常规写法：
性别分布：
select sex, count(distinct user_id) from user_info group by sex; 城市分布：

```
	select city,
			count(distinct user_id)
	from user_info
	group by city;
	```
	
	等级分布：
	
	```
	select level,
			count(distinct user_id)
	from user_info
	group by level;
	```
	缺点：分别写三次SQL,要执行三次，重复工作费时

优化写法：

select sex,
		city, 
		level,
		count(distinct user_id)
from user_info
grouping sets(sex,city,level);

hive sql 原理 hive sql教程_hive_03

等价于将不同维度的group by 结果进行union all

用户性别分布及每个性格的城市分布

常规写法：

性别分布

select sex,
		count(distinct user_id)
from user_info
group by sex;

每个性别的城市分布：

select sex,
		city,
		count(distinct user_id)
from user_info
group by sex, city;

优化写法：

select sex,
		city,
		count(distinct user_id)
from user_info
group by sex, city
grouping sets (set,(set,city));

hive sql 原理 hive sql教程_大数据_04

注意：第二列为空即为性别的分布；

2. cube

cube：根据group by维度的所有组合进行聚合

性别、城市、等级的各种组合的用户分布

常规写法：

select sex,
		city,
		level,
		count(distinct user_id)
from user_info
group by sex, city, level
grouping sets (set,city,level, (set,city),(sex,level),(city,level),(sex,city,level));

优化写法：

select sex,
		city,
		level,
		count(distinct user_id)
from user_info
group by sex, city, level
with cube;

3. rollup

rollup: 以最左侧的维度为主，进行层级聚合，是cube的子集

同时计算出每个月的支付金额，以及每年的支付金额

常规写法：

select a.dt, 
		sum(a.year_amount),
		sum(a.month_amount)
from
	(select substr(dt,1,4) as dt,
			sum(pay_amount) as year_amount,
			0 as month_amount
	from user_trade
	where dt>'0'
	gourp by substr(dt,1,4)
	union all
	select substr(dt,1,7) as dt,
			0 as year_amount,
			sum(pay_amount) as month_amount
	from user_trade
	where dt>'0'
	group by substr(dt,1,7) )  a
group by a.dt;

hive sql 原理 hive sql教程_大数据_05

优化写法：

select  year(dt) as year,
		month(dt) as month,
		sum(pay_amount)
from user_trade
where dt>'0'
group by year(dt),month(dt)
with rollup;

hive sql 原理 hive sql教程_大数据_06

三、union all 时可以开启并发执行

参数设置：set hive.exec.parallel=true 开启并发执行，提高执行效率

每个用户的支付和退款金额汇总

select  a.user_name,
		a.pay_amount,
		a.refund_amount
from
	(select  user_name,
	 		sum(pay_amount) as pay_amount,
	 		0 as refund_amount
	from user_trade
	where dt>'0'
	group by user_name
	union all
	select user_name,
			0 as pay_amound,
			sum(refund_amount) as refund_amount
	from user_refund
	where dt>'0'
	group by user_name) a 
group by a.user_name;

时间对比：

未开启并发：

hive sql 原理 hive sql教程_hive_07

hive sql 原理 hive sql教程_数据倾斜_08

开启并发：

hive sql 原理 hive sql教程_大数据_09

hive sql 原理 hive sql教程_数据倾斜_10

四、使用lateral view 进行行列转换

每个品类的购买用户数

select b.category,
		count(distinct a.user_name)
from user_goods_category a
lateral view explode(split(category_detail,',')) b as category
group by b.category;

hive sql 原理 hive sql教程_ci_11

explode:行转列函数

concat_ws(’,’ , collect_set(column)) :列转行函数

五、表连接优化

小表在前，大表在后
hive 默认查询最后一个表是大表，会将其他表缓存起来，然后扫描最后那个表
使用相同的连接键
当对3个或以上表进行join连接时，如果每个on 都使用相同的连接键，那么只会产生一个MapReduce job
尽早过滤数据
逻辑复杂时引入中间表

六、解决数据倾斜

数据倾斜表现

任务进度长时间维持在99%（或者100%），任务监控页面，只有少量（1个或者几个）reduce子任务未完成。因为其处理的数据量和其他reduce差异比较大

数据倾斜产生原因和解决办法

空值产生的数据倾斜
解决：如果两个表连接时，使用的连接条件有很多空值时，建议在连接条件中增加过滤
列如：on a.user_id=b.user_id and a.user_id is not null
大小表连接（一张表很大，另一张表很小）
解决：将小表放在内存中，在map端做join

select b.*
from a join b on a.*=ｂ.*;

两个表连接条件的字段数据类型不一致
解决：两个表连接条件的字段数据类型转换成一致的
列如：on a.user_id = cast(b.user_id as string)

七、如何计算按月累计去重

用sum() over() 计算按一定周期进行累计求和，如何计算按月累计去重？

2017年、2018年按月累计去重的购买用户数

select b.year,
		b.month,
		sum(b.user_nem) over (partition by b.year order by b.month)
from
		(select a.year,
				a.month,
				count(distinct a.user_name) user_num
		from
				(select  year(dt) as dt,
						user_name,
						min(month(dt)) as month
				from user_trade
				where year(dt) in (2017,2018)
				group by year(dt), user_name ) a
		group by a.year, a.month ) b 
order by b.year, b.month
limit 24;

hive sql 原理 hive sql教程_hive_12

第二种写法:

set hive.mapred.mode=nonstrict;
select 
from 
	(select  substr(dt,1,7 ) as month,
			user_name
	from user_trade
	where year(dt) in (2017,2018)
	group by substr(dt,1,7 ) as month,
			user_name) a
	cross join
	(select 
	from user_trade
	where )

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：hive sql修数据里面的内容 hive数据库

下一篇：hive可以写hbase hbase和hive怎么配合使用

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯