hive sort_array用法 hive中sort by

转载

mob64ca140e4022 2023-09-07 18:42:34

文章标签 hive sort_array用法 Hive order by hive hadoop 文章分类 Hive 大数据

具有相同 Distribute By 列的所有行将进入相同的 reducer
https://www.docs4dev.com/docs/zh/apache-hive/3.1.1/reference/LanguageManual_SortBy.html

---------------

1、order by

hive中的order by 会对查询结果集执行一个全局排序，这也就是说所有的数据都通过一个reduce进行处理的过程，对于大数据集，这个过程将消耗很大的时间来执行。

hive sort_array用法 hive中sort by_hive

2、sort by

hive的sort by 也就是执行一个局部排序过程。这可以保证每个reduce的输出数据都是有序的(但并非全局有效)。这样就可以提高后面进行的全局排序的效率了。对于这两种情况，语法区别仅仅是，一个关键字是order，另一个关键字是sort。用户可以指定任意期望进行排序的字段，并可以在字段后面加上asc关键字(默认)表示升序，desc关键字是降序排序。

在使用sort by之前，需要先设置Reduce的数量>1，才会做局部排序，如果Reduce数量是1，作用与order by一样，全局排序。

hive sort_array用法 hive中sort by_order by_02

3、distribute by

distribute by 控制 map的输出在reduer中是如何划分的，mapreduce job 中传输的所有数据都是按照键-值对的方式进行组织的，因此hive在将用户的查询语句转换成mapreduce job时，其必须在内部使用这个功能。默认情况下，MapReduce计算框架会依据map输入的键计算相应的哈希值，然后按照得到的哈希值将键-值对均匀分发到多个reducer中去，不过不幸的是，这也是意味着当我们使用sort by 时，不同reducer的输出内容会有明显的重叠，至少对于排序顺序而已只这样，即使每个reducer的输出的数据都有序的。如果我们想让同一年的数据一起处理，那么就可以使用distribute by 来保证具有相同年份的数据分发到同一个reducer中进行处理，然后使用sort by 来安装我们的期望对数据进行排序:

hive sort_array用法 hive中sort by_hive sort_array用法_03

4、cluster by

cluster by 除了distribute by 的功能外，还会对该字段进行排序，所以cluster by = distribute by +sort by 。

eg：select * from table cluster by year;

等价于：select * from table distribute by year sort by year;

hive sort_array用法 hive中sort by_hive_04

----------------

hive中的distribute by是控制在map端如何拆分数据给reduce端的。
hive会根据distribute by后面列，根据reduce的个数进行数据分发，默认是采用hash算法。

对于distribute by进行测试，一定要分配多reduce进行处理，否则无法看到distribute by的效果。

hive> select * from test09;
 OK
 100 tom
 200 mary
 300 kate
 400 tim
 Time taken: 0.061 secondshive> insert overwrite local directory ‘/home/hjl/sunwg/ooo’ select * from test09 distribute by id;
 Total MapReduce jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks not specified. Defaulting to jobconf value of: 2
 In order to change the average load for a reducer (in bytes):
 set hive.exec.reducers.bytes.per.reducer=
 In order to limit the maximum number of reducers:
 set hive.exec.reducers.max=
 In order to set a constant number of reducers:
 set mapred.reduce.tasks=
 Starting Job = job_201105020924_0070, Tracking URL = http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0070
 Kill Command = /home/hjl/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=hadoop00:9001 -kill job_201105020924_0070
 2011-05-03 06:12:36,644 Stage-1 map = 0%, reduce = 0%
 2011-05-03 06:12:37,656 Stage-1 map = 50%, reduce = 0%
 2011-05-03 06:12:39,673 Stage-1 map = 100%, reduce = 0%
 2011-05-03 06:12:44,713 Stage-1 map = 100%, reduce = 50%
 2011-05-03 06:12:46,733 Stage-1 map = 100%, reduce = 100%
 Ended Job = job_201105020924_0070
 Copying data to local directory /home/hjl/sunwg/ooo
 Copying data to local directory /home/hjl/sunwg/ooo
 4 Rows loaded to /home/hjl/sunwg/ooo
 OK
 Time taken: 17.663 seconds

第一次执行根据id字段来做分发，结果如下：

[hjl@sunwg src]$ cat /home/hjl/sunwg/ooo/attempt_201105020924_0070_r_000000_0
 400tim
 200mary
 [hjl@sunwg src]$ cat /home/hjl/sunwg/ooo/attempt_201105020924_0070_r_000001_0
 300kate
 100tom

这次我们换个分发的方式，采用length(id)的结果，因为这几条记录的id字段的长度都相同，所以应该会被分布到同一个reduce中。

hive> insert overwrite local directory ‘/home/hjl/sunwg/lll’ select * from test09 distribute by length(id);
 Total MapReduce jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks not specified. Defaulting to jobconf value of: 2
 In order to change the average load for a reducer (in bytes):
 set hive.exec.reducers.bytes.per.reducer=
 In order to limit the maximum number of reducers:
 set hive.exec.reducers.max=
 In order to set a constant number of reducers:
 set mapred.reduce.tasks=
 Starting Job = job_201105020924_0071, Tracking URL = http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0071
 Kill Command = /home/hjl/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=hadoop00:9001 -kill job_201105020924_0071
 2011-05-03 06:15:21,430 Stage-1 map = 0%, reduce = 0%
 2011-05-03 06:15:24,454 Stage-1 map = 100%, reduce = 0%
 2011-05-03 06:15:31,509 Stage-1 map = 100%, reduce = 50%
 2011-05-03 06:15:34,539 Stage-1 map = 100%, reduce = 100%
 Ended Job = job_201105020924_0071
 Copying data to local directory /home/hjl/sunwg/lll
 Copying data to local directory /home/hjl/sunwg/lll
 4 Rows loaded to /home/hjl/sunwg/lll
 OK
 Time taken: 20.632 seconds

在查看下结果是否和我们的预期相同：

[hjl@sunwg src]$ cat /home/hjl/sunwg/lll/attempt_201105020924_0071_r_000000_0
 [hjl@sunwg src]$ cat /home/hjl/sunwg/lll/attempt_201105020924_0071_r_000001_0
 100tom
 200mary
 300kate
 400tim

文件attempt_201105020924_0071_r_000000_0中没有记录，而全部的记录都在attempt_201105020924_0071_r_000001_0中。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。