文章目录

  • 前言
  • 一、order by(全局排序)
  • 1.1 应用order by
  • 1.2 按照位置编号
  • 二、sort by (对reducer中数据排序)
  • 2.1 查看reduce个数
  • 2.2 案例
  • 三、distribute by
  • 四、cluster by
  • 五、group by
  • 六、having与where不同点
  • 七、聚合函数
  • 八、窗口函数
  • 九、聚合函数
  • 9.1 count
  • 9.2 sum
  • 9.3 avg
  • 十、窗口函数-分析
  • 10.1 cume_dist
  • 十一、窗口字句


前言

扩展:

--设置日志
set hive.server2.logging.operation.level=NONE   ---无日志
set hive.server2.logging.operation.level=EXECUTION   --有日志

一、order by(全局排序)

  1. order by 从英文里理解就是行的排序方式,默认的为升序。 order by 后面必须列出排序的字段名,可以是多个字段名。
  2. group by 从英文里理解就是分组。必须有“聚合函数”来配合才能使用,使用时至少需要一个分组标志字段。
  3. 只使用一个Reducer执行全局数据排序
  4. 速度慢,应提前做好数据过滤
  5. 支持使用CASE WHEN或表达式
  6. 支持按位置编号排序 set hive.groupby.orderby.position.alias=true;

1.1 应用order by

select name,dept_num,employee_id,salary
from employee_contract
order by salary desc limit 5;

hive中窗口含税排序_hive中窗口含税排序

1.2 按照位置编号

使用之前我们可以查看按编号排序开没开启

set hive.groupby.orderby.position.alias;

hive中窗口含税排序_聚合函数_02


编号排序案例:

set hive.groupby.orderby.position.alias=true;

set hive.groupby.orderby.position.alias=true;
select 
name,
dept_num,
employee_id,
salary
from employee_contract
order by 4 desc;
--这里的4指的是按照第4列排序

hive中窗口含税排序_hive中窗口含税排序_03

二、sort by (对reducer中数据排序)

SORT BY对每个Reducer中的数据进行排序
当Reducer数量设置为1时,等于ORDER BY
sort by 对每一个reduce 内部数据进行排序,全局结果集来说不是排序
set mapreduce.job.reuces = 3 ;
会生成3个文件,每个文件内都是排好序的。

2.1 查看reduce个数

使用之前,可以查看reduce有几个
默认-1,只会开一个

set mapreduce.job.reduces;

hive中窗口含税排序_聚合函数_04

2.2 案例

select 
name,
dept_num,
employee_id,
salary
from 
employee_contract
sort by salary desc;

hive中窗口含税排序_数据_05


设置reduce=2的时候

set mapreduce.job.reduces=2;

select 
name,
dept_num,
employee_id,
salary
from 
employee_contract
sort by salary desc;
--在每个reduce里有序

可以看出,数据被分成2个部分,分别降序

hive中窗口含税排序_数据_06

三、distribute by

distribute by 类似于sql中的groupby
一般搭配sort by

--distribute by + sort by

select 
name,
dept_num,
employee_id,
salary
from employee_contract
distribute by dept_num sort by salary;

hive中窗口含税排序_窗口函数_07

四、cluster by

---CLUSTER BY   只能针对同一个字段 按照salary分区,也按照salary排序

set mapreduce.job.reduces=2;
select 
name,
dept_num,
employee_id,
salary
from employee_contract
cluster by salary;
//等同于distribute by dept_num sort by dept_num;

hive中窗口含税排序_hive中窗口含税排序_08

五、group by

---group by
---每个部门所有员工的薪资总和 
select 
dept_num,
name,
sum(salary)
from employee_contract
group by dept_num,name;


select salary,sum(salary)
from employee_contract
group by salary if(salary>5000,'good','bad');

六、having与where不同点

  1. where针对表中的列发挥作用,查询数据;having针对查询结果中的列发挥作用,筛选数据
  2. where后面不能写聚合函数,而having后面可以使用聚合函数。
  3. having只用于group by分组统计语句

七、聚合函数

max, min, count, sum, avg
max(distinct col1)、avg(col2)等
collect_set, collect_list:返回每个组列中的对象集/列表

八、窗口函数

8.1 窗口函数是什么鬼?

  1. 窗口函数指定了函数工作的数据窗口大小(当前行的上下多少行),这个数据窗口大小可能会随着行的变化而变化。

8.2 窗口函数和聚合函数区别?

  1. 窗口函数对于每个组返回多行,组内每一行对应返回一行值。
  2. 聚合函数对于每个组只返回一行
ROW_NUMBER()
对所有数值输出不同的序号,序号唯一连续
RANK()
对相同数值,输出相同的序号,下一个序号跳过(1,1,3)
DENSE_RANK()
对相同数值,输出相同的序号,下一个序号连续(1,1,2)

NLITE(n)
将有序的数据集合平均分配到n个桶中, 将桶号分配给每一行,根据桶号,选取前或后 n分之几的数据
PERCENT_RANK()
(目前排名- 1)/(总行数- 1),值相对于一组值的百分比排名

8.3 案例

--row_number()  (1,2,3)
--给所有员工薪资排序
--从低到高,针对所有员工
// 每个窗口函数后需要命名

select
name,dept_num,salary,
row_number() over(order by salary) as rn
from employee_contract;

hive中窗口含税排序_数据_09

--rank()    (1,1,3)

select 
name,dept_num,salary
rank() over(order by salary) as rn
from employee_contract

hive中窗口含税排序_hive中窗口含税排序_10

--dense_rank()  (1,1,2,3,3,4)

select 
name,dept_num,salary,
dense_rank() over(order by salary) as rn
from employee_contract

hive中窗口含税排序_hive中窗口含税排序_11

--按照部门,对每个部门的薪资排序
---按部门分组  薪资排序
select
name,dept_num,salary,
row_number() over(partition by dept_num order by salary) as rn
from employee_contract;

hive中窗口含税排序_窗口函数_12

--按部门分组,获取每个部门薪资最低的员工(考虑重复)
子查询  查询结果为一张表    
分组求top n
没有分组
select name,dept_num,salary
from(
select
name,dept_num,salary,
dense_rank() over(order by salary) as rn 
from employee_contract
)t
where t.rn=1;

hive中窗口含税排序_数据_13

分组的情况

select name,dept_num,salary
from(
select
name,dept_num,salary,
dense_rank() over(partition by dept_num order by salary) as rn 
from employee_contract
)t
where t.rn=1;

hive中窗口含税排序_hive中窗口含税排序_14

查看内存的命令

free -lh

设置本地模式

set hive.exec.mode.local.auto

hive中窗口含税排序_hive中窗口含税排序_15

---Ntile(n)
将有序的数据集合平均分配到n个桶中,
将桶号分配给每一行,根据桶号,
选取前或后 n分之几的数据
select
name,dept_num,salary,
ntile(2) over(partition by dept_num order by salary) as nlite
from employee_contract;

hive中窗口含税排序_hive中窗口含税排序_16

--percent_rank
(目前排名- 1)/(总行数- 1),值相对于一组值的百分比排名

select
name,dept_num,salary,
percent_rank() over(order by salary) as pr
from employee_contract;

hive中窗口含税排序_hive中窗口含税排序_17

综合

select 
name, dept_num, salary, 
ROW_NUMBER() OVER () AS row_num, 
RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS rank, 
DENSE_RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS dense_rank,
PERCENT_RANK() OVER(PARTITION BY dept_num ORDER BY salary) AS percent_rank, 
NTILE(2) OVER(PARTITION BY dept_num ORDER BY salary) AS ntile 
FROM employee_contract 
ORDER BY dept_num, salary;

hive中窗口含税排序_窗口函数_18

九、聚合函数

9.1 count

select name,dept_num,salary,
count(*) over(partition by dept_num) rc
from employee_contract;

hive中窗口含税排序_hive中窗口含税排序_19

9.2 sum

select name,dept_num,salary,
sum(salary) over(partition by dept_num order by salary) sum
from employee_contract;

hive中窗口含税排序_数据_20

select name,dept_num,salary,
sum(salary) over(order by salary) sum
from employee_contract;
相同序号的会看成一个整体
重复的当成一个整体都加上去了

hive中窗口含税排序_数据_21

9.3 avg

select name,dept_num,salary,
avg(salary) over(partition by dept_num) avgdept,
max(salary) over(partition by dept_num) maxdept,
min(salary) over(partition by dept_num) mindept
from employee_contract;

hive中窗口含税排序_hive中窗口含税排序_22

十、窗口函数-分析

10.1 cume_dist

  1. cume_dist: 小于等于当前值的行数/分组内总行数
  2. lead/lag(col列名字,n位置):某一列进行往前/后第n行值(n可选,默认为1)
  3. first_value:对该列到目前为止的首个值
  4. last_value:到目前行为止的最后一个值

案例:lead/lag(col,n)

select name,dept_num,salary,
lead(salary,2) over(partition by dept_num order by salary) lead
from employee_contract;
往下调2行

hive中窗口含税排序_数据_23

select name,dept_num,salary,
lead(salary,1) over(partition by dept_num order by salary) lead
from employee_contract;
往下调一行

hive中窗口含税排序_窗口函数_24

--lag(往前找) 看上一行结果
select name,dept_num,salary,
lag(salary,1) over(partition by dept_num order by salary) lead
from employee_contract;

hive中窗口含税排序_窗口函数_25

--first_value
select name,dept_num,salary,
first_value(salary) over(partition by dept_num order by salary) first_value
from employee_contract;

hive中窗口含税排序_hive中窗口含税排序_26

----last_value
--到当前行最后一个值就是本身
select name,dept_num,salary,
last_value(salary) over(partition by dept_num order by salary) last_value
from employee_contract;

hive中窗口含税排序_hive中窗口含税排序_27

--cume_dist
-- 小于等于当前值的行数/分组内总行数
select name,dept_num,salary,
cume_dist() over(partition by dept_num order by salary) cd
from employee_contract;

hive中窗口含税排序_数据_28

十一、窗口字句

  1. rows between
  2. undounded preceding :分组第一行到当前行
  3. current row :当前行
  4. n following :往后n行
  5. undounded following :分组最后一行
--案例:

SELECT
name, dept_num AS dept, salary AS sal,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) win4,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) win5,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN CURRENT ROW AND CURRENT ROW) win6,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) win7,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) win8,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) win9,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) win10,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) win11,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS 2 PRECEDING) win12
FROM employee_contract  ORDER BY dept, name;

运行部分案例:

SELECT
name, dept_num AS dept, salary AS sal,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3
FROM employee_contract  ORDER BY dept, name;

hive中窗口含税排序_聚合函数_29