hive中窗口含税排序

转载

mob64ca141a2a87 2024-07-10 22:17:14

文章标签 hive中窗口含税排序窗口函数聚合函数数据 文章分类 Hive 大数据

文章目录

前言
一、order by（全局排序）

1.1 应用order by
1.2 按照位置编号

二、sort by （对reducer中数据排序）

2.1 查看reduce个数
2.2 案例

三、distribute by
四、cluster by
五、group by
六、having与where不同点
七、聚合函数
八、窗口函数
九、聚合函数

9.1 count
9.2 sum
9.3 avg

十、窗口函数-分析

10.1 cume_dist

十一、窗口字句

前言

扩展：

--设置日志
set hive.server2.logging.operation.level=NONE   ---无日志
set hive.server2.logging.operation.level=EXECUTION   --有日志

一、order by（全局排序）

order by 从英文里理解就是行的排序方式，默认的为升序。 order by 后面必须列出排序的字段名，可以是多个字段名。
group by 从英文里理解就是分组。必须有“聚合函数”来配合才能使用，使用时至少需要一个分组标志字段。
只使用一个Reducer执行全局数据排序
速度慢,应提前做好数据过滤
支持使用CASE WHEN或表达式
支持按位置编号排序 set hive.groupby.orderby.position.alias=true;

1.1 应用order by

select name,dept_num,employee_id,salary
from employee_contract
order by salary desc limit 5;

hive中窗口含税排序_hive中窗口含税排序

1.2 按照位置编号

使用之前我们可以查看按编号排序开没开启

set hive.groupby.orderby.position.alias;

hive中窗口含税排序_聚合函数_02

编号排序案例：

set hive.groupby.orderby.position.alias=true;

set hive.groupby.orderby.position.alias=true;
select 
name,
dept_num,
employee_id,
salary
from employee_contract
order by 4 desc;
--这里的4指的是按照第4列排序

hive中窗口含税排序_hive中窗口含税排序_03

二、sort by （对reducer中数据排序）

SORT BY对每个Reducer中的数据进行排序
当Reducer数量设置为1时，等于ORDER BY
sort by 对每一个reduce 内部数据进行排序，全局结果集来说不是排序
set mapreduce.job.reuces = 3 ;
会生成3个文件，每个文件内都是排好序的。

2.1 查看reduce个数

使用之前，可以查看reduce有几个
默认-1，只会开一个

set mapreduce.job.reduces;

hive中窗口含税排序_聚合函数_04

2.2 案例

select 
name,
dept_num,
employee_id,
salary
from 
employee_contract
sort by salary desc;

hive中窗口含税排序_数据_05

设置reduce=2的时候

set mapreduce.job.reduces=2;

select 
name,
dept_num,
employee_id,
salary
from 
employee_contract
sort by salary desc;
--在每个reduce里有序

可以看出，数据被分成2个部分，分别降序

hive中窗口含税排序_数据_06

三、distribute by

distribute by 类似于sql中的groupby
一般搭配sort by

--distribute by + sort by

select 
name,
dept_num,
employee_id,
salary
from employee_contract
distribute by dept_num sort by salary;

hive中窗口含税排序_窗口函数_07

四、cluster by

---CLUSTER BY   只能针对同一个字段 按照salary分区，也按照salary排序

set mapreduce.job.reduces=2;
select 
name,
dept_num,
employee_id,
salary
from employee_contract
cluster by salary;
//等同于distribute by dept_num sort by dept_num;

hive中窗口含税排序_hive中窗口含税排序_08

五、group by

---group by
---每个部门所有员工的薪资总和 
select 
dept_num,
name,
sum(salary)
from employee_contract
group by dept_num,name;


select salary,sum(salary)
from employee_contract
group by salary if(salary>5000,'good','bad');

六、having与where不同点

where针对表中的列发挥作用，查询数据；having针对查询结果中的列发挥作用，筛选数据
where后面不能写聚合函数，而having后面可以使用聚合函数。
having只用于group by分组统计语句

七、聚合函数

max, min, count, sum, avg
max(distinct col1)、avg(col2)等
collect_set, collect_list：返回每个组列中的对象集/列表

八、窗口函数

8.1 窗口函数是什么鬼？

窗口函数指定了函数工作的数据窗口大小(当前行的上下多少行)，这个数据窗口大小可能会随着行的变化而变化。

8.2 窗口函数和聚合函数区别？

窗口函数对于每个组返回多行，组内每一行对应返回一行值。
聚合函数对于每个组只返回一行

ROW_NUMBER()
对所有数值输出不同的序号，序号唯一连续
RANK()
对相同数值，输出相同的序号，下一个序号跳过（1,1,3）
DENSE_RANK()
对相同数值，输出相同的序号，下一个序号连续（1,1,2）

NLITE(n)
将有序的数据集合平均分配到n个桶中, 将桶号分配给每一行，根据桶号，选取前或后 n分之几的数据
PERCENT_RANK()
(目前排名- 1)/(总行数- 1)，值相对于一组值的百分比排名

8.3 案例

--row_number()  (1,2,3)
--给所有员工薪资排序
--从低到高，针对所有员工
// 每个窗口函数后需要命名

select
name,dept_num,salary,
row_number() over(order by salary) as rn
from employee_contract;

hive中窗口含税排序_数据_09

--rank()    (1,1,3)

select 
name,dept_num,salary
rank() over(order by salary) as rn
from employee_contract

hive中窗口含税排序_hive中窗口含税排序_10

--dense_rank()  (1,1,2,3,3,4)

select 
name,dept_num,salary,
dense_rank() over(order by salary) as rn
from employee_contract

hive中窗口含税排序_hive中窗口含税排序_11

--按照部门，对每个部门的薪资排序
---按部门分组  薪资排序
select
name,dept_num,salary,
row_number() over(partition by dept_num order by salary) as rn
from employee_contract;

hive中窗口含税排序_窗口函数_12

--按部门分组，获取每个部门薪资最低的员工(考虑重复)
子查询  查询结果为一张表    
分组求top n
没有分组
select name,dept_num,salary
from(
select
name,dept_num,salary,
dense_rank() over(order by salary) as rn 
from employee_contract
)t
where t.rn=1;

hive中窗口含税排序_数据_13

分组的情况

select name,dept_num,salary
from(
select
name,dept_num,salary,
dense_rank() over(partition by dept_num order by salary) as rn 
from employee_contract
)t
where t.rn=1;

hive中窗口含税排序_hive中窗口含税排序_14

查看内存的命令

free -lh

设置本地模式

set hive.exec.mode.local.auto

hive中窗口含税排序_hive中窗口含税排序_15

---Ntile(n)
将有序的数据集合平均分配到n个桶中,
将桶号分配给每一行，根据桶号，
选取前或后 n分之几的数据

select
name,dept_num,salary,
ntile(2) over(partition by dept_num order by salary) as nlite
from employee_contract;

hive中窗口含税排序_hive中窗口含税排序_16

--percent_rank
(目前排名- 1)/(总行数- 1)，值相对于一组值的百分比排名

select
name,dept_num,salary,
percent_rank() over(order by salary) as pr
from employee_contract;

hive中窗口含税排序_hive中窗口含税排序_17

综合

select 
name, dept_num, salary, 
ROW_NUMBER() OVER () AS row_num, 
RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS rank, 
DENSE_RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS dense_rank,
PERCENT_RANK() OVER(PARTITION BY dept_num ORDER BY salary) AS percent_rank, 
NTILE(2) OVER(PARTITION BY dept_num ORDER BY salary) AS ntile 
FROM employee_contract 
ORDER BY dept_num, salary;

hive中窗口含税排序_窗口函数_18

九、聚合函数

9.1 count

select name,dept_num,salary,
count(*) over(partition by dept_num) rc
from employee_contract;

hive中窗口含税排序_hive中窗口含税排序_19

9.2 sum

select name,dept_num,salary,
sum(salary) over(partition by dept_num order by salary) sum
from employee_contract;

hive中窗口含税排序_数据_20

select name,dept_num,salary,
sum(salary) over(order by salary) sum
from employee_contract;
相同序号的会看成一个整体
重复的当成一个整体都加上去了

hive中窗口含税排序_数据_21

9.3 avg

select name,dept_num,salary,
avg(salary) over(partition by dept_num) avgdept,
max(salary) over(partition by dept_num) maxdept,
min(salary) over(partition by dept_num) mindept
from employee_contract;

hive中窗口含税排序_hive中窗口含税排序_22

十、窗口函数-分析

10.1 cume_dist

cume_dist: 小于等于当前值的行数/分组内总行数
lead／lag(col列名字,n位置):某一列进行往前/后第n行值(n可选，默认为1)
first_value:对该列到目前为止的首个值
last_value:到目前行为止的最后一个值

案例：lead/lag(col,n)

select name,dept_num,salary,
lead(salary,2) over(partition by dept_num order by salary) lead
from employee_contract;
往下调2行

hive中窗口含税排序_数据_23

select name,dept_num,salary,
lead(salary,1) over(partition by dept_num order by salary) lead
from employee_contract;
往下调一行

hive中窗口含税排序_窗口函数_24

--lag(往前找) 看上一行结果
select name,dept_num,salary,
lag(salary,1) over(partition by dept_num order by salary) lead
from employee_contract;

hive中窗口含税排序_窗口函数_25

--first_value
select name,dept_num,salary,
first_value(salary) over(partition by dept_num order by salary) first_value
from employee_contract;

hive中窗口含税排序_hive中窗口含税排序_26

----last_value
--到当前行最后一个值就是本身
select name,dept_num,salary,
last_value(salary) over(partition by dept_num order by salary) last_value
from employee_contract;

hive中窗口含税排序_hive中窗口含税排序_27

--cume_dist
-- 小于等于当前值的行数/分组内总行数
select name,dept_num,salary,
cume_dist() over(partition by dept_num order by salary) cd
from employee_contract;

hive中窗口含税排序_数据_28

十一、窗口字句

rows between
undounded preceding :分组第一行到当前行
current row :当前行
n following ：往后n行
undounded following :分组最后一行

--案例：

SELECT
name, dept_num AS dept, salary AS sal,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) win4,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) win5,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN CURRENT ROW AND CURRENT ROW) win6,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) win7,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) win8,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) win9,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) win10,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) win11,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS 2 PRECEDING) win12
FROM employee_contract  ORDER BY dept, name;

运行部分案例：

SELECT
name, dept_num AS dept, salary AS sal,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3
FROM employee_contract  ORDER BY dept, name;

hive中窗口含税排序_聚合函数_29

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：子查询返回的值有索引吗

下一篇：本地电脑怎么部署ovf虚拟机虚拟机搭建本地服务器

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯