文章目录
- 前言
- 一、order by(全局排序)
- 1.1 应用order by
- 1.2 按照位置编号
- 二、sort by (对reducer中数据排序)
- 2.1 查看reduce个数
- 2.2 案例
- 三、distribute by
- 四、cluster by
- 五、group by
- 六、having与where不同点
- 七、聚合函数
- 八、窗口函数
- 九、聚合函数
- 9.1 count
- 9.2 sum
- 9.3 avg
- 十、窗口函数-分析
- 10.1 cume_dist
- 十一、窗口字句
前言
扩展:
--设置日志
set hive.server2.logging.operation.level=NONE ---无日志
set hive.server2.logging.operation.level=EXECUTION --有日志
一、order by(全局排序)
- order by 从英文里理解就是行的排序方式,默认的为升序。 order by 后面必须列出排序的字段名,可以是多个字段名。
- group by 从英文里理解就是分组。必须有“聚合函数”来配合才能使用,使用时至少需要一个分组标志字段。
- 只使用一个Reducer执行全局数据排序
- 速度慢,应提前做好数据过滤
- 支持使用CASE WHEN或表达式
- 支持按位置编号排序 set hive.groupby.orderby.position.alias=true;
1.1 应用order by
select name,dept_num,employee_id,salary
from employee_contract
order by salary desc limit 5;
1.2 按照位置编号
使用之前我们可以查看按编号排序开没开启
set hive.groupby.orderby.position.alias;
编号排序案例:
set hive.groupby.orderby.position.alias=true;
set hive.groupby.orderby.position.alias=true;
select
name,
dept_num,
employee_id,
salary
from employee_contract
order by 4 desc;
--这里的4指的是按照第4列排序
二、sort by (对reducer中数据排序)
SORT BY对每个Reducer中的数据进行排序
当Reducer数量设置为1时,等于ORDER BY
sort by 对每一个reduce 内部数据进行排序,全局结果集来说不是排序
set mapreduce.job.reuces = 3 ;
会生成3个文件,每个文件内都是排好序的。
2.1 查看reduce个数
使用之前,可以查看reduce有几个
默认-1,只会开一个
set mapreduce.job.reduces;
2.2 案例
select
name,
dept_num,
employee_id,
salary
from
employee_contract
sort by salary desc;
设置reduce=2的时候
set mapreduce.job.reduces=2;
select
name,
dept_num,
employee_id,
salary
from
employee_contract
sort by salary desc;
--在每个reduce里有序
可以看出,数据被分成2个部分,分别降序
三、distribute by
distribute by 类似于sql中的groupby
一般搭配sort by
--distribute by + sort by
select
name,
dept_num,
employee_id,
salary
from employee_contract
distribute by dept_num sort by salary;
四、cluster by
---CLUSTER BY 只能针对同一个字段 按照salary分区,也按照salary排序
set mapreduce.job.reduces=2;
select
name,
dept_num,
employee_id,
salary
from employee_contract
cluster by salary;
//等同于distribute by dept_num sort by dept_num;
五、group by
---group by
---每个部门所有员工的薪资总和
select
dept_num,
name,
sum(salary)
from employee_contract
group by dept_num,name;
select salary,sum(salary)
from employee_contract
group by salary if(salary>5000,'good','bad');
六、having与where不同点
- where针对表中的列发挥作用,查询数据;having针对查询结果中的列发挥作用,筛选数据
- where后面不能写聚合函数,而having后面可以使用聚合函数。
- having只用于group by分组统计语句
七、聚合函数
max, min, count, sum, avg
max(distinct col1)、avg(col2)等
collect_set, collect_list:返回每个组列中的对象集/列表
八、窗口函数
8.1 窗口函数是什么鬼?
- 窗口函数指定了函数工作的数据窗口大小(当前行的上下多少行),这个数据窗口大小可能会随着行的变化而变化。
8.2 窗口函数和聚合函数区别?
- 窗口函数对于每个组返回多行,组内每一行对应返回一行值。
- 聚合函数对于每个组只返回一行
ROW_NUMBER()
对所有数值输出不同的序号,序号唯一连续
RANK()
对相同数值,输出相同的序号,下一个序号跳过(1,1,3)
DENSE_RANK()
对相同数值,输出相同的序号,下一个序号连续(1,1,2)
NLITE(n)
将有序的数据集合平均分配到n个桶中, 将桶号分配给每一行,根据桶号,选取前或后 n分之几的数据
PERCENT_RANK()
(目前排名- 1)/(总行数- 1),值相对于一组值的百分比排名
8.3 案例
--row_number() (1,2,3)
--给所有员工薪资排序
--从低到高,针对所有员工
// 每个窗口函数后需要命名
select
name,dept_num,salary,
row_number() over(order by salary) as rn
from employee_contract;
--rank() (1,1,3)
select
name,dept_num,salary
rank() over(order by salary) as rn
from employee_contract
--dense_rank() (1,1,2,3,3,4)
select
name,dept_num,salary,
dense_rank() over(order by salary) as rn
from employee_contract
--按照部门,对每个部门的薪资排序
---按部门分组 薪资排序
select
name,dept_num,salary,
row_number() over(partition by dept_num order by salary) as rn
from employee_contract;
--按部门分组,获取每个部门薪资最低的员工(考虑重复)
子查询 查询结果为一张表
分组求top n
没有分组
select name,dept_num,salary
from(
select
name,dept_num,salary,
dense_rank() over(order by salary) as rn
from employee_contract
)t
where t.rn=1;
分组的情况
select name,dept_num,salary
from(
select
name,dept_num,salary,
dense_rank() over(partition by dept_num order by salary) as rn
from employee_contract
)t
where t.rn=1;
查看内存的命令
free -lh
设置本地模式
set hive.exec.mode.local.auto
---Ntile(n)
将有序的数据集合平均分配到n个桶中,
将桶号分配给每一行,根据桶号,
选取前或后 n分之几的数据
select
name,dept_num,salary,
ntile(2) over(partition by dept_num order by salary) as nlite
from employee_contract;
--percent_rank
(目前排名- 1)/(总行数- 1),值相对于一组值的百分比排名
select
name,dept_num,salary,
percent_rank() over(order by salary) as pr
from employee_contract;
综合
select
name, dept_num, salary,
ROW_NUMBER() OVER () AS row_num,
RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS rank,
DENSE_RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS dense_rank,
PERCENT_RANK() OVER(PARTITION BY dept_num ORDER BY salary) AS percent_rank,
NTILE(2) OVER(PARTITION BY dept_num ORDER BY salary) AS ntile
FROM employee_contract
ORDER BY dept_num, salary;
九、聚合函数
9.1 count
select name,dept_num,salary,
count(*) over(partition by dept_num) rc
from employee_contract;
9.2 sum
select name,dept_num,salary,
sum(salary) over(partition by dept_num order by salary) sum
from employee_contract;
select name,dept_num,salary,
sum(salary) over(order by salary) sum
from employee_contract;
相同序号的会看成一个整体
重复的当成一个整体都加上去了
9.3 avg
select name,dept_num,salary,
avg(salary) over(partition by dept_num) avgdept,
max(salary) over(partition by dept_num) maxdept,
min(salary) over(partition by dept_num) mindept
from employee_contract;
十、窗口函数-分析
10.1 cume_dist
- cume_dist: 小于等于当前值的行数/分组内总行数
- lead/lag(col列名字,n位置):某一列进行往前/后第n行值(n可选,默认为1)
- first_value:对该列到目前为止的首个值
- last_value:到目前行为止的最后一个值
案例:lead/lag(col,n)
select name,dept_num,salary,
lead(salary,2) over(partition by dept_num order by salary) lead
from employee_contract;
往下调2行
select name,dept_num,salary,
lead(salary,1) over(partition by dept_num order by salary) lead
from employee_contract;
往下调一行
--lag(往前找) 看上一行结果
select name,dept_num,salary,
lag(salary,1) over(partition by dept_num order by salary) lead
from employee_contract;
--first_value
select name,dept_num,salary,
first_value(salary) over(partition by dept_num order by salary) first_value
from employee_contract;
----last_value
--到当前行最后一个值就是本身
select name,dept_num,salary,
last_value(salary) over(partition by dept_num order by salary) last_value
from employee_contract;
--cume_dist
-- 小于等于当前值的行数/分组内总行数
select name,dept_num,salary,
cume_dist() over(partition by dept_num order by salary) cd
from employee_contract;
十一、窗口字句
- rows between
- undounded preceding :分组第一行到当前行
- current row :当前行
- n following :往后n行
- undounded following :分组最后一行
--案例:
SELECT
name, dept_num AS dept, salary AS sal,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) win4,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) win5,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN CURRENT ROW AND CURRENT ROW) win6,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) win7,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) win8,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) win9,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) win10,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) win11,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS 2 PRECEDING) win12
FROM employee_contract ORDER BY dept, name;
运行部分案例:
SELECT
name, dept_num AS dept, salary AS sal,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3
FROM employee_contract ORDER BY dept, name;