不同类型比较
不同类型的数字float double做比较,要注意0.2float大于0.2double
可以cat(0.2 as float)
order by and sort by
hive的order by是全部数据的排序,在一个reduce中处理排序,默认升序。效率比较低,通常跟limit一起用
可以用hive.mapred.mode=strict来强制,order后跟着limit
sort by是在每个reduce中,进行排序,
(是否在一个reduce中,由groupcomparator决定,如果没有就是key的compare,在hive中UDAF在控制mapper的key输出到reducer上,普通的compact是hash分布到key上,或是自定义compact的hash均匀分布ketama算法)
distribut by是让相同的key归到同一个reducer中,这样sort可以进行reducer中的排序
(注意 sort需要放在distribute前边)
cluster
cluster by是一种distribute和sort的简写,让按照clauses中语句分组,并且按照其他字段排序
cast函数
cast类型转换函数,当string不符合条件,则为unknown data为null
可以嵌套cast(cast(cast(binary as string)as double))
将float转换为int 可以用round floor
取样查询 sample
rand随机
select * from numbers tablesample(bucket 3 out of 10 on rand()) s;
column随机:以一个column,这样会在多个线程里跑multiple runs
select * from numbers tablesample(bucket 3 out of 10 on number) s;
block随机: 另一个取样函数(block sampling 当表小于block size128mb,则全部rows返回)
有个hive.sample.seednumber来控制seed information for block based
select * from numbersflat tablesample(0.1 percent) s;
使用number做sample,用以下语句,让sample按照sample存储,可以只访问hash file中需要的bucket
create table numbers_bucketed(number int) clustered by (number) into 3 buckets
set hive.enforce.bucketing = true;
union
select * from table1
union all
select * from table2
from(
from src select src.key , src.value where src.key < 100
union all
from src select src.* where src.key >100
) unioninput
insert overwrite directory '/tmp/union.out' select unioninput.*