hive 对列去重 hive中哪些命令可以去重

转载

mob6454cc623087 2023-07-12 11:14:57

文章标签 hive 对列去重 Hive 数据 ide 文章分类 Hive 大数据

一、Hive中collect_list和collect_set的区别

二、Hive中四种排序(order by、sort by、distribute by、cluster by)

2.1 order by
2.1 sort by
2.1 distribute by
2.1 cluster by

三、Hive的累加操作

四、Hive分析窗口函数

一、Hive中collect_list和collect_set的区别

Hive中collect相关的函数有collect_list和collect_set.

它们都是将分组中的某列转换为一个数组返回，不同的是collect_list不重而collect去重。

做简单的实验加深理解，创建一张实验用表，存放用户每天点播视频的记录：

create table t_visit_viedo (
	username string,
	video_name string
) partitioned by (day string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

我的本地数据目录路径：/home/hadoop/data，vi info.log：

john,PGD-585
sail,CJOD-160
john,PGD-876
sail,MIDD-799
sail,CJOD-160
fox,IPZ-218
fox,IPX-030
fox,KAWD-445

将数据加载到Hive表：

load data local inpath '/home/hadoop/data/info.log' overwrite into table t_visit_video partition(day='2019-09-18');

需求一：

按照用户分组，取出每个用户每天看过的所有视频的名字：
记住一句话：group by后面跟的词一定要在select中出现。

select username, collect_list(video_name) from t_visit_video
group by username;

这个SQL是要跑MapReduce作业的：

结果如下：

fox     ["IPZ-218","IPX-030","KAWD-445"]
john    ["PGD-585","PGD-876"]
sail    ["CJOD-160","MIDD-799","CJOD-160"]
Time taken: 224.447 seconds, Fetched: 3 row(s)

我们观察到上面的结果中，sail看的电影已经重复了；所以应该增加去重，使用collect_set，它与collect_list的区别就是增加了去重功能：

select username, collect_set(video_name) from t_visit_video
group by username;

输出结果：

fox     ["IPZ-218","IPX-030","KAWD-445"]
john    ["PGD-585","PGD-876"]
sail    ["CJOD-160","MIDD-799"]

sail的观看记录中CJOD-160只出现了一次，实现了去重效果。

突破group by限制：

我们还可以利用collect来突破group by的限制，Hive中在group by查询的时候要求出现在select后面的列都必须是出现在group by后面的，即select列必须是作为分组依据的列，但是有的时候我们想根据A进行分组然后随便取出每个分组中的一个B，代入到这个测试中就是按照用户进行分组，然后随便拿出一个他看过的视频名称即可：
video_name不是分组列，但我们以然能够取出这列中的数据

select username, collect_list(video_name)[0]
from t_visit_video
group by username;

取出分组中的第一列（下标也是从0开始的）：

fox     IPZ-218
john    PGD-585
sail    CJOD-160

二、Hive中四种排序(order by、sort by、distribute by、cluster by)

2.1 order by

语法：

order by //可以指定desc(降序)、asc(升序)

order by会对输入做全局排序，因此只有一个Reducer（多个Reducer无法保证全局有序），然而只有一个Reducer时，会导致输入规模较大时，消耗较长的计算时间。

实际举例：

1、Hive中创建表：

create table temperature (
year int,
temper float
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

2、本地目录：/home/hadoop/data/下创建文件：

2019-09-09	32
2019-09-10	34
2019-09-11	28
2019-09-12	24
2019-09-13	18
2019-09-14	29
2019-09-15	32
2019-09-16	35
2019-09-17	27
2019-09-18	25
2019-09-19	30
2019-09-20	25

2.2 sort by

sort by 不是全局排序，其在数据进入reducer前完成排序，因此，如果使用sort by进行排序，并且设置mapred.reduce.tsaks>1,则sort by只会保证每个reducer的输出有序，并不能保证全局有序。
sort by不同于order by，它不受Hive.mapred.mode属性的影响，sort by的数据只能保证在同一个reduce中的数据可以按指定字段排序。使用sort by你可以指定执行的reduce个数（通过set.mapred.reduce.tasks=n来指定），对输出的数据再执行归并排序，即可得到全部结果。

//设置reduce的个数为3；
set mapred.reduce.tasks
//查询此次任务中reduce的个数：
set mapred.reduce.tasks;

SQL语句：

select * from temperature sort by year;

2.3 distribute by

distribute by是控制在map端如何拆分数据给reduce端的。hive会根据distribute by后面列，对应reduce的个数进行分发，默认采用hash算法。sort by为每一个reduce产生一个排序文件。在有些情况下，你需要控制某个特定的行应该到哪个reducer，这通常是为了进行后续的聚集操作。distribute by刚好可以做这件事情。因此，distribute by经常和sort by配合使用。

//根据年份和气温对气象数据进行排序，以确保所有具有相同年份的行最终都在一个分区reducer中

注意：DIstribute by和sort by的使用场景
1、Map输出的文件和大小不均
2、reduce输出文件大小不均
3、小文件多
4、文件大

三、Hive的累加解法

需求一：

给出一批数据，现要求出：

每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数。

数据准备：三个字段的意思分别是：用户名，月份，访问次数

A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,11

最后结果展示：

用户	月份		最大访问次数	总访问次数		当月访问次数
A	2015-01		33			33		  33
A	2015-02		33			43		  10
A	2015-03		38			81		  38
B	2015-01		30			30		  30
B	2015-02		30			45              15
B	2015-03		44			89		  44

解答：

1、在Hive中创建一个user_visit表：

create table user_visit (
name string,
month string,
num int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

2、从本地导入数据到Hive中：

load data local inpath ‘/home/hadoop/data/test.txt’ into table user_visit;

3、求得当月访问次数

select name,month,sum(num) sn from user_visit
group by name,month;

输出结果：

+-------+----------+-----+--+
| name  |  month   | sn  |
+-------+----------+-----+--+
| A     | 2015-01  | 33  |
| A     | 2015-02  | 10  |
| A     | 2015-03  | 38  |
| B     | 2015-01  | 30  |
| B     | 2015-02  | 15  |
| B     | 2015-03  | 44  |
+-------+----------+-----+--+

hive 对列去重 hive中哪些命令可以去重_ide

4、创建一张临时表tmp_user_visit用于保存3中的数据：

create table tmp_user_visit 
as
select name,month,sum(num) sn from user_visit
group by name,month;

hive 对列去重 hive中哪些命令可以去重_Hive_02

5、根据tmp_user_visit表求出访问次数，将其连同临时表中的数据放入新表tmp_user_visitb表中

create table tmp_user_visitb 
as
select name,month,sn
sum(sn) over (partition by name order by month) as zcs
from tmp_user_visit;

6、根据临时表b求出最大访问次数，将其连同tmp_user_visitb的数据放入最终表formal_user_visit中：

create table formal_user_visit
as
select name,month,max(sn) over (partition by name order by month) as maxfw,zcs,sn
from tmp_user_visitb;

四、Hive分析窗口函数（Hive做累计、分组、排序、层次等计算）

应用场景：

1）、用于分区排序
2）、动态Group By
3）、Top N
4）、累计计算
5）、层次查询

Hive中提供了越来越多的分析函数，用于完成负责的统计分析。大致可以分为以下四类：

Hive分析窗口函数（一）：SUM,AVG,MIN,MAX

This section introdeces the Hive QL enhancements for windowing and analytics functions. See “Windowing Specifications in HQL” for details. Hive-896 has more information,including links to earlier documentation in the initial comments.

All of the windowing and analytics functions operator as per the SQL standard.

The current release supports the following functions for windowing and analytics:
1、Windowing functions（窗口函数）

LEAD
- The number of rows to lead can optionally be specified.If the number of rows to lead is not specified, the lead is one row.
- Returns null when the lead for the current row extends beyond the end of the window
LAG
- The number of rows to lag can optionally be specified. If the number of rows to lag is not specified, the lag is one row.
- Returns null when the lag for the current row extends before the beginning of the window.
FIRST_VALUE
- This takes at most two parameters.The fist parameter is the column for which you want the first value, the second (optioned) parameter must be a boolean which is false by default. If set to true it skips null values.

LAST_VALUE

This takes at most two parameters, The first parameter is the column for which you want the last value, the second (optional) parameter must be a boolean which is false by default. If set to true it skips null values.

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：android nexus 刷机工具包 nexus7 刷机包

下一篇：hash表是什么 java hash表优缺点

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯