by hive插入使用order hive order by desc

转载

代码工匠大师 2023-07-04 19:42:38

文章标签 by hive插入使用order 数据 sed hive 文章分类 Hive 大数据

一、order by：

order by会对输入做全局排序，因此只有一个Reducer(多个Reducer无法保证全局有序)，然而只有一个Reducer，会导致当输入规模较大时，消耗较长的计算时间。关于order by的详细介绍请参考这篇文章：Hive Order by操作。
DESC 是descend 降序，asc 是ascend 升序

by hive插入使用order hive order by desc_sed

二：sort by：

sort by不是全局排序，其在数据进入reducer前完成排序，是一个局部排序的过程，可以保证每个reduce的输出数据都是有序的(但并非全局有效)。这样就可以提高后面进行的全局排序的效率了。
如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只会保证每个reducer的输出有序，并不保证全局有序。sort by不同于order by，它不受hive.mapred.mode属性的影响，sort by的数据只能保证在同一个reduce中的数据可以按指定字段排序。使用sort by你可以指定执行的reduce个数(通过set mapred.reduce.tasks=n来指定)，对输出的数据再执行归并排序，即可得到全部结果。

by hive插入使用order hive order by desc_hive_02

三：distribute by：

distribute by 控制 map的输出在reduer中是如何划分的，mapreduce job 中传输的所有数据都是按照键-值对的方式进行组织的，因此hive在将用户的查询语句转换成mapreduce job时，其必须在内部使用这个功能。默认情况下，MapReduce计算框架会依据map输入的键计算相应的哈希值，然后按照得到的哈希值将键-值对均匀分发到多个reducer中去，不过不幸的是，这也是意味着当我们使用sort by 时，不同reducer的输出内容会有明显的重叠，至少对于排序顺序而已只这样，即使每个reducer的输出的数据都有序的。如果我们想让同一年的数据一起处理，那么就可以使用distribute by 来保证具有相同年份的数据分发到同一个reducer中进行处理，然后使用sort by 来安装我们的期望对数据进行排序:
distribute by是控制在map端如何拆分数据给reduce端的。hive会根据distribute by后面列，对应reduce的个数进行分发，默认是采用hash算法。sort by为每个reduce产生一个排序文件。在有些情况下，你需要控制某个特定行应该到哪个reducer，这通常是为了进行后续的聚集操作。distribute by刚好可以做这件事。因此，distribute by经常和sort by配合使用。

注：Distribute by和sort by的使用场景
1.Map输出的文件大小不均。
2.Reduce输出文件大小不均。
3.小文件过多。
4.文件超大。

by hive插入使用order hive order by desc_hive_03

四：cluster by：

cluster by除了有distribute by的功能外还会对该字段进行排序兼具sort by的功能，相当于distribute by+sort by。但是排序只能是倒序排序，不能指定排序规则为ASC或者DESC。
eg：select * from table cluster by year;
等价于：select * from table distribute by year sort by year

by hive插入使用order hive order by desc_sed_04

Hive SQL 执行顺序：

1. FROM:A Cartesian product (cross join) is performed between the first two tables in the FROM clause, and as a result, virtual table VT1 is generated.
2. ON: The ON filter is applied to VT1. Only rows for which the <join_condition>is TRUE are inserted to VT2.
3. OUTER(join): If an OUTER JOIN is specified(as opposed to a CROSS JOIN or an INNER JOIN), rows from the preserved table or tables for which a match was not found are added to the rows from VT2 as outer rows, generating VT3. If more than two tables appear in the FROM clause, steps 1 through 3 are applied repeatedly between the result of the last join and the next table in the FROM clause until all tables are processed.
4. WHERE: The WHERE filter is applied to VT3. Only rows for which the <where _condition>is TRUE are inserted to VT4.
5. GROUP BY: The rows from VT4 are arranged in groups based on the column list specified in the GROUP BY clause. VT5 is generated.
6. CUBE | ROLLUP: Supergroups(groups of groups) are added to the rows from VT5, generating VT6.
7. HAVING: The HAVING filter is applied to VT6. Only groups for which the <having_condition>is TRUE are inserted to VT7.
8. SELECT: The SELECT list is processed, generating VT8.
9. DISTINCT: Duplicate rows are removed from VT8. VT9 is generated.
10.ORDER BY: The rows from VT9 are sorted according to the column list specified in the ORDER BY clause.A cursor is generated(VC10).
11.TOP: The specified number or percentage of rows is selected from the beginning of VC10. Table VT11 is generated and returned to the caller.