by hive插入使用order hive order by desc
转载
- order by会对输入做全局排序,因此只有一个Reducer(多个Reducer无法保证全局有序),然而只有一个Reducer,会导致当输入规模较大时,消耗较长的计算时间。关于order by的详细介绍请参考这篇文章:Hive Order by操作。
- DESC 是descend 降序,asc 是ascend 升序
- sort by不是全局排序,其在数据进入reducer前完成排序,是一个局部排序的过程,可以保证每个reduce的输出数据都是有序的(但并非全局有效)。这样就可以提高后面进行的全局排序的效率了。
- 如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只会保证每个reducer的输出有序,并不保证全局有序。sort by不同于order by,它不受hive.mapred.mode属性的影响,sort by的数据只能保证在同一个reduce中的数据可以按指定字段排序。使用sort by你可以指定执行的reduce个数(通过set mapred.reduce.tasks=n来指定),对输出的数据再执行归并排序,即可得到全部结果。
- distribute by 控制 map的输出在reduer中是如何划分的,mapreduce job 中传输的所有数据都是按照键-值对的方式进行组织的,因此hive在将用户的查询语句转换成mapreduce job时,其必须在内部使用这个功能。默认情况下,MapReduce计算框架会依据map输入的键计算相应的哈希值,然后按照得到的哈希值将键-值对均匀分发到多个reducer中去,不过不幸的是,这也是意味着当我们使用sort by 时,不同reducer的输出内容会有明显的重叠,至少对于排序顺序而已只这样,即使每个reducer的输出的数据都有序的。如果我们想让同一年的数据一起处理,那么就可以使用distribute by 来保证具有相同年份的数据分发到同一个reducer中进行处理,然后使用sort by 来安装我们的期望对数据进行排序:
- distribute by是控制在map端如何拆分数据给reduce端的。hive会根据distribute by后面列,对应reduce的个数进行分发,默认是采用hash算法。sort by为每个reduce产生一个排序文件。在有些情况下,你需要控制某个特定行应该到哪个reducer,这通常是为了进行后续的聚集操作。distribute by刚好可以做这件事。因此,distribute by经常和sort by配合使用。
- 注:Distribute by和sort by的使用场景
- 1.Map输出的文件大小不均。
- 2.Reduce输出文件大小不均。
- 3.小文件过多。
- 4.文件超大。
- cluster by除了有distribute by的功能外还会对该字段进行排序兼具sort by的功能,相当于distribute by+sort by。但是排序只能是倒序排序,不能指定排序规则为ASC或者DESC。
- eg:select * from table cluster by year;
等价于:select * from table distribute by year sort by year
Hive SQL 执行顺序:
- 1. FROM:A Cartesian product (cross join) is performed between the first two tables in the FROM clause, and as a result, virtual table VT1 is generated.
- 2. ON: The ON filter is applied to VT1. Only rows for which the <join_condition>is TRUE are inserted to VT2.
- 3. OUTER(join): If an OUTER JOIN is specified(as opposed to a CROSS JOIN or an INNER JOIN), rows from the preserved table or tables for which a match was not found are added to the rows from VT2 as outer rows, generating VT3. If more than two tables appear in the FROM clause, steps 1 through 3 are applied repeatedly between the result of the last join and the next table in the FROM clause until all tables are processed.
- 4. WHERE: The WHERE filter is applied to VT3. Only rows for which the <where _condition>is TRUE are inserted to VT4.
- 5. GROUP BY: The rows from VT4 are arranged in groups based on the column list specified in the GROUP BY clause. VT5 is generated.
- 6. CUBE | ROLLUP: Supergroups(groups of groups) are added to the rows from VT5, generating VT6.
- 7. HAVING: The HAVING filter is applied to VT6. Only groups for which the <having_condition>is TRUE are inserted to VT7.
- 8. SELECT: The SELECT list is processed, generating VT8.
- 9. DISTINCT: Duplicate rows are removed from VT8. VT9 is generated.
- 10.ORDER BY: The rows from VT9 are sorted according to the column list specified in the ORDER BY clause.A cursor is generated(VC10).
- 11.TOP: The specified number or percentage of rows is selected from the beginning of VC10. Table VT11 is generated and returned to the caller.
本文章为转载内容,我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题,欢迎原作者联系我们进行内容更正或删除文章。