hive map类型转字符串 hive map aggr

转载

技术博客达人 2023-08-18 22:27:40

文章标签 hive map类型转字符串 hive 字段数据倾斜 文章分类 Hive 大数据

Join 的实现原理

以下面这个 SQL 为例，讲解 join 的实现：

select , o.orderid from order o join user u on o.uid = u.uid;

在 map 的输出 value 中为不同表的数据打上 tag 标记，在 reduce 阶段根据 tag 判断数据来源。MapReduce 的过程如下：

hive map类型转字符串 hive map aggr_字段

MapReduce CommonJoin 的实现

group by

原理
将group by 的字段作为key，即groupBy_key 。在shuffle阶段，groupBy_key作为shuffle阶段的key，进行hash分区。

select 
	user.gender,
	count(1) 
from user 
group by gende

hive map类型转字符串 hive map aggr_hive map类型转字符串_02

group by导致数据倾斜
groupBy_key % reduceTaskNum 不均匀会导致某些reduceTask数据量大，从而导致数据倾斜；
优化手段1：预聚合
配置： hive.map.aggr=true
该参数控制在group by的时候是否map局部聚合，这个参数默认是打开的。

适用场景：
聚合方式不影响业务逻辑（count\sum\min\max\）（count distinct 会关闭预聚合）
groupBy_key 重复越多效果越好；如果groupBy_key是唯一键，开启此参数没有意义，并且造成计算资源的浪费。
相关参数：
hive.groupby.mapaggr.checkinterval = 100000
Hive.map.aggr.hash.min.reductinotallow=0.5
上面这两个参数控制关掉map聚合的策略。Map开始的时候先尝试给前100000 条记录做hash聚合，如果聚合后的记录数/100000>0.5说明这个groupby_key没有什么重复的，再继续做局部聚合没有意义，在聚合100000 以后就自动把预聚合开关关掉，降级到普通的Aggregation。

Distinct 的实现原理

distinct全局去重预聚合无法使用

select 
	gender,
	count(distinct id) 
from user
group by gender

由于map需要保存所有的userid，map聚合开关会自动关掉，导致出现计算不均衡的现象，只有2个redcue做聚合，每个reduce处理100亿条记录。

1. 只有一个distinct 字段

select 
	dealid, 
	count(distinct uid) num 
from order 
group by dealid;

hive map类型转字符串 hive map aggr_hive map类型转字符串_03

跟去GroupBy字段作为shuffle Key进行分区
将GroupBy字段和Distinct字段作为reduce的key，在reduce阶段保存LastKey即可完成去重。

2.多个distinct字段

select 
	dealid, 
	count(distinct uid), 
	count(distinct date) 
from order 
group by dealid;

本质和单个distinct字段是一样的，只不过是经过处理后才能达到和单个distinct同样的效果：

对所有的distinct字段编号，这样的话每行数据会生成n行数据；
然后groupByKey + 编号 + distinctKey 作为Map的输出key，那么相同字段就会分别排序
同时groupByKey + 编号 + distinctKey 作为reduce的key，这时只需要在reduce阶段记录LastKey即可去重。

hive map类型转字符串 hive map aggr_数据倾斜_04

以下面这个 SQL 为例，讲解 distinct 的实现：

select dealid, count(distinct uid) num from order

当只有一个 distinct 字段时，如果不考虑 Map 阶段的 Hash GroupBy，只需要将 GroupBy 字段和 Distinct 字段组合为 map 输出 key，利用 mapreduce 的排序，同时将 GroupBy 字段作为 reduce 的 key，在 reduce 阶段保存 LastKey 即可完成去重

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。