Hive高级查询

  • Hive查询
  • SELECT基础
  • CTE和嵌套擦查询
  • 进阶语句
  • Hive JOIN
  • 关联查询
  • MAPJOIN
  • Hive集合操作(UNION)
  • 装载数据
  • LOAD移动数据
  • INSERT表插入数据(一)
  • INSERT表插入数据(二)
  • Hive数据插入文件
  • Hive数据交换-IMPORT/EXPORT
  • Hive数据排序
  • ORDER BY
  • SORT BY
  • DISTRIBUTE BY
  • CLUSTER BY
  • Hive聚合运算
  • 概述
  • GROUP BY
  • HAVING
  • 基础聚合
  • 高级聚合(一)
  • 高级聚合(二)
  • 窗口函数
  • 概述
  • 排序
  • 聚合
  • 分析
  • 窗口定义(一)
  • 窗口定义(二)
  • 窗口定义(三)


Hive查询

SELECT基础

SELECT用于映射符合指定查询条件的行
Hive SELECT是数据库标准SQL的子集

  • 使用方法类似于Mysql
SELECT 1;
 SELECT [DISTINCT] column_nam_list FROM table_name;
 SELECT * FROM table_name;
 SELECT * FROM employee WHERE name!='Lucy'  LIMIT 5;

CTE和嵌套擦查询

CTE(Common Table Expression)

-- CTE语法
WITH t1 AS (SELECT …) SELECT * FROM t1

嵌套查询

-- 嵌套查询示例
SELECT * FROM (SELECT * FROM employee) a;

进阶语句

列匹配正则表达式

SET hive.support.quoted.identifiers = none;
SELECT `^o.*` FROM offers;

虚拟列(Virtual Columns)

  • 两个连续下划线,用于数据验证

INPUT__FILE__NAME:Mapper Task的输入文件名称
BLOCK__OFFSET__INSIDE_FILE:当前全局文件位置

Hive JOIN

关联查询

指对多表进行联合查询
JOIN用于将两个或多个表中的行组合在一起查询
类似于SQL JOIN,但是Hive仅支持等值连接

  • 内连接:INNER JOIN
  • 外连接:OUTER JOIN
RIGHT JOIN,LEFT JOIN,FULL OUTER JOIN
  • 交叉连接:CROSS JOIN
  • 隐式连接:Implicit JOIN
    JOIN发生在WHERE子句之前
    示例
Area C = Circle1 JOIN Circle2
 Area A = Circle1 LEFT OUTER JOIN Circle2
 Area B = Circle1 RIGHT OUTER JOIN Circle2
 AUBUC = Circle1 FULL OUTER JOIN Circle2

MAPJOIN

MaoJoin操作在Map端完成

  • 小表关联大表
  • 可进行不等值连接
    开启join连接
set hive.auto.convert.join=true(默认值)
 运行时自动将连接转换为MAPJOIN

MAPJOIN操作不支持

在UNION ALL,LATERAL VIEW,GROUP BY/JOIN/SORT BY/CLUSTER BY/DISTRIBUTE BY等操作后面
在UNION,JOIN以及其他MAPJOIN之前

Hive集合操作(UNION)

所有子集数据必须具有相同的名称和类型

  • UNION ALL:合并后保留重复项
  • UNION:合并后删除重复项(v1.2之后)
    可以在顶层查询中使用(0.13.0之后)
    ORDER BY,SORT BY,CLUSTER BY,DISTRIBUTE BY和LIMIT适用于合并后的整个结果
    集合其他操作可以使用JOIN/OUTER JOIN来实现
  • 差集、交集
//MINUS
SELECT a.name 
FROM employee a
LEFT JOIN employee_hr b
ON a.name = b.name
WHERE b.name IS NULL;

//INTERCEPT
SELECT a.name 
FROM employee a
JOIN employee_hr b
ON a.name = b.name;

装载数据

LOAD移动数据

LOAD用于在Hive中移动数据

  • 原始数据被移动到目标表/分区,不再存在于原始位置
LOAD DATA LOCAL INPATH '/home/dayongd/Downloads/employee.txt' 
OVERWRITE INTO TABLE employee;
-- LOCAL表示文件位于本地,OVERWRITE表示覆盖现有数据
LOAD DATA LOCAL INPATH '/home/dayongd/Downloads/employee.txt' 
OVERWRITE INTO TABLE employee_partitioned  PARTITION (year=2014, month=12);
-- 没有LOCAL,文件位于HDFS文件系统中
LOAD DATA INPATH '/tmp/employee.txt'  
OVERWRITE INTO TABLE employee_partitioned PARTITION (year=2017, month=12);
  • LOCAL:指定文件位于本地文件系统
  • OVERWRITE表示覆盖现有数据

INSERT表插入数据(一)

使用INSERT语句将数据插入表/分区

-- INSERT支持OVERWRITE(覆盖)和INTO(追加)
INSERT OVERWRITE/INTO TABLE tablename1 
[PARTITION (partcol1=val1, partcol2=val2 ...)] 
select fileds,... from tb_other;
  • Hive支持从同一个表进行多次插入
  • INSERT INTO中TABLE关键字是可选的
  • INSERT INTO可以指定插入到哪些字段中
如:INSERT INTO t(x,y,z)
  • INSERT INTO table_name VALUES,支持插入值列表
  • 数据插入必须与指定列数相同

INSERT表插入数据(二)

INSERT OVERWRITE TABLE test select 'hello'; -- INSERT不支持的写法
insert into employee select * from ctas_employee; -- 通过查询语句插入
-- 多插入
from ctas_employee
insert overwrite table employee select *
insert overwrite table employee_internal select *;
-- 插入到分区
from ctas_patitioned 
insert overwrite table employee PARTITION (year, month)
select *,'2018','09';
-- 通过指定列插入(insert into可以省略table关键字)
insert into employee(name) select 'John' from test limit 1;
-- 通过指定值插入
insert into employee(name) value('Judy'),('John');

Hive数据插入文件

使用insert语句将数据插入/导入到文件

  • 文件插入只支持OVERWRITE
  • 支持来自同一个数据源/表的多次插入
  • LOCAL:写入本地文件系统
  • 默认数据以TEXT格式写入,列由^A分隔
  • 支持自定义分隔符导出文件为不同格式:CSV、JSON等
-- 从同一数据源插入本地文件,hdfs文件,表
from ctas_employee
insert overwrite local directory '/tmp/out1'  select *
insert overwrite directory '/tmp/out1' select *
insert overwrite table employee_internal select *;
-- 以指定格式插入数据
insert overwrite directory '/tmp/out3'
row format delimited fields terminated by ','
select * from ctas_employee;
-- 其他方式从表获取文件
hdfs dfs -getmerge <table_file_path>

Hive数据交换-IMPORT/EXPORT

IMPORT和EXPORT用于数据导入和导出

  • 常用于数据迁移场景
  • 除数据库,可导入导出所有数据和元数据
    使用EXPORT导出数据
EXPORT TABLE employee TO '/tmp/output3';
EXPORT TABLE employee_partitioned partition (year=2014, month=11) TO '/tmp/output5';

使用IMPORT导入数据

IMPORT TABLE employee FROM '/tmp/output3';
IMPORT TABLE employee_partitioned partition (year=2014, month=11) FROM '/tmp/output5';

Hive数据排序

ORDER BY

ORDER BY(ASC|DESC)类似于标准SQL

  • 只使用一个Reduce执行全局数据排序
  • 速度慢,应提前做好数据过滤
  • 支持使用CASE WHEN或表达式
  • 支持按位置编号排序
set hive.groupby.orderby.position.alias=true;
select * from offers order by case when offerid = 1 then 1 else 0 end;
select * from offers order by 1;

SORT BY

SORT BY对每个Reduce中的数据进行排序

  • 当Reduce数量设置为1时,等于ORDER BY
  • 排序列必须出现在SELET column列表中

DISTRIBUTE BY

DISTRIBUTE BY类似于标准SQL中的GROUP BY

  • 确保具有匹配列值的行被分区到相同的Reduce
  • 不会对每个Reducer的输出进行排序
  • 通常使用在SORT BY语句之前
SELECT department_id , name, employee_id, evaluation_score
FROM employee_hr 
DISTRIBUTE BY department_id SORT BY evaluation_score DESC;

CLUSTER BY

CLUSTER BY=DISTRIBUTE BY+SORT BY

  • 不支持ASC|DESC
  • 排序列必须出现在SELECT column列表中
  • 为了充分利用所有的Reducer来执行全局排序,可以先使用CLUSTER BY,然后使用ORDER BY
SELECT name, employee_id FROM employee_hr CLUSTER BY name;

hive struct 嵌套map hive 嵌套查询_hive struct 嵌套map

Hive聚合运算

概述

  • GROUP BY(分组)
  • HAVING(聚合条件)
  • 基础聚合
  • 高级聚合

GROUP BY

GROUP BY用于分组

  • Hive基本内置聚合函数与GROUP BY一起使用
  • 如果没有指定GROUP BY子句,则默认聚合整个表
  • 除聚合函数外,所选的其他列也必须包含在GROUP BY中
  • GROUP BY支持使用CASE WHEN或表达式
  • 支持按位置编号分组
set hive.groupby.orderby.position.alias=true;
select category, max(offervalue) from offers group by category;
-- group by使用表达式
select if(category > 4000, 'GOOD', 'BAD') as newcat,max(offervalue) from offers group by category if(category > 4000, 'GOOD', 'BAD');

HAVING

HAVING:对GROUP BY聚合结果的条件过滤

  • 可以避免在GROUP BY之后使用子查询
  • HAVING之后可以使用表达式,不建议
--having使用
select sex_age.age from employee group by sex_age.age having count(*) <= 1;
--使用子查询代替having
select a.age from ( select count(*) as cnt, sex_age.age 
from employee group by sex_age.age ) a where a.cnt <= 1;

基础聚合

使用内置聚合函数进行数据聚合

  • max,min,count,sum,avg
max(distinct col)、avg(distinct col)等
  • collect_set,collect_list:返回每个组列中的对象集/列表
    与GROUP BY一起使用,可应用于列或表达式
    没有GROUP BY则按所有列聚合
    select的列必须包含在GROUP BY中
    对NULL的聚合为0
  • select count(null)=0

高级聚合(一)

GROUPING SETS

  • 实现对同一数据集进行多重GROUP BY操作
  • 本质是多个GROUP BY进行UNION ALL操作

GROUPING SETS语句

等价于GROUP BY语句

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) )

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b UNION SELECT null, null, SUM( c ) FROM tab1

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a)

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION SELECT a, null, SUM( c ) FROM tab1 GROUP BY a

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) )

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b

SELECT a,b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS (a,b)

SELECT a, null, SUM( c ) FROM tab1 GROUP BY a UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY b

高级聚合(二)

GROUP BY WITH CUBE|ROLLUP

  • CUBE:对分组列进行所有可能组合的聚合
  • ROLLUP:计算维度层次级别上的聚合

ROLLUP/CUBE语句

等价于GROUP BY语句

SELECT a, b, SUM( c )FROM tab1 GROUP BY a, b, c WITH CUBE

SELECT a, b, SUM( c )FROM tab1 GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), ( c ), ( ))

SELECT a, b, SUM( c )FROM tab1 GROUP BY a, b, c WITH ROLLUP

SELECT a, b, SUM( c )FROM tab1 GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( ))

窗口函数

概述

窗口函数时一组特殊函数

  • 扫描多个输入行来计算每个输出值,为每行数据生成一行结果
  • 可以通过窗口函数来实现复杂的计算和聚合

语法

Function (arg1,..., arg n) OVER ([PARTITION BY <...>] [ORDER BY <....>] [<window_clause>])
  • PARTITION BY类似于GROUP BY,未指定则按真个结果集
  • 只有指定ORDER BY子句之后才能进行窗口定义
  • 可同时使用多个窗口函数
  • 过滤窗口函数计算结果必须在外面一层
    按功能可划分为:排序、聚合、分析

排序

ROW_NUMBER()

  • 对所有数值输出不同的序号,序号唯一连续

RANK()

  • 对相同数值,输出相同的序号下一个序号跳过(1,1,3)

DENSE_RANK()

  • 对相同数值,输出相同的序号,下一个序号连续(1,1,2)

NLITE(n)

  • 将有序的数据集合平均分配到n个桶中,将桶号分配给每一行,根据桶号,选取前或后n分之几的数据
    PERCENT_RANK()
  • (目前排名-1)/(总行数-1),值相对于一组值的百分比排名
-- 窗口函数 排序类
SELECT 
name, dept_num, salary, 
ROW_NUMBER() OVER () AS row_num, 
RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS rank, 
DENSE_RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS dense_rank,
PERCENT_RANK() OVER(PARTITION BY dept_num ORDER BY salary) AS percent_rank, 
NTILE(2) OVER(PARTITION BY dept_num ORDER BY salary) AS ntile 
FROM employee_contract 
ORDER BY dept_num, salary;

聚合

COUNT()

  • 计数,可以和DISTINCT一起用
SELECT 
COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

SUM():求和
AVG():平均值
MAx()/MIN():最大值/小值
从Hive2.1.0开始早OVER子句中支持聚合函数

SELECT rank() OVER (ORDER BY sum(b))  FROM T GROUP BY a;
-- 窗口函数 聚合类
SELECT 
name, dept_num, salary,
COUNT(*) OVER (PARTITION BY dept_num) AS row_cnt,
--COUNT(DISTINCT *) OVER (PARTITION BY dept_num) AS row_cnt_dis,
SUM(salary) OVER(PARTITION BY dept_num ORDER BY dept_num) AS deptTotal,
SUM(salary) OVER(ORDER BY dept_num) AS runningTotal1, 
SUM(salary) OVER(ORDER BY dept_num, name rows unbounded preceding) AS runningTotal2,
AVG(salary) OVER(PARTITION BY dept_num) AS avgDept,
MIN(salary) OVER(PARTITION BY dept_num) AS minDept,
MAX(salary) OVER(PARTITION BY dept_num) AS maxDept
FROM employee_contract
ORDER BY dept_num, name;

分析

CUME_DIST

  • 小于等于当前值的行数/分组内总行数
    LEAD/LAG(col,n)
  • 某一列进行往前/后第n行值(n可选,默认为1)
    FIRST_VALUE
  • 对该列到目前为止的首个值
    LAST_VALUE
  • 到目前行为止的最后一个值
-- 窗口函数 分析类
SELECT 
name, dept_num, salary,
LEAD(salary, 2) OVER(PARTITION BY dept_num ORDER BY salary) AS lead,
LAG(salary, 2, 0) OVER(PARTITION BY dept_num ORDER BY salary) AS lag,
FIRST_VALUE(salary) OVER (PARTITION BY dept_num ORDER BY salary) AS first_value,
LAST_VALUE(salary) OVER (PARTITION BY dept_num ORDER BY salary) AS last_value_default,
LAST_VALUE(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_value
FROM employee_contract
ORDER BY dept_num, salary;

窗口定义(一)

窗口定义由[<window_clause>]子句描述

  • 用于进一步细分并应用分析函数
    支持两类窗口定义
  • 行类型窗口
  • 范围类型窗口
    RANK、NTILE、DENSE_RANK、CUME_DIST、PERCENT_RANK、LEAD、LAG和ROW_NUMBER函数不支持与窗口子句一起使用

窗口定义(二)

行窗口:根据当前航之前或之后的行号确定的窗口

hive struct 嵌套map hive 嵌套查询_hive struct 嵌套map_02

SELECT
name, dept_num AS dept, salary AS sal,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) win4,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) win5,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND CURRENT ROW) win6,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) win7,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) win8,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) win9,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) win10,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) win11,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS 2 PRECEDING) win12
FROM employee_contract  ORDER BY dept, name;

窗口定义(三)

范围窗口是区分组内的值在指定范围区间内的行

  • 该范围值/区间必须是数字或日期类型
  • 目前只支持一个ORDER BY列
SUM(close) RANGE BETWEEN 500 PRECEDING AND 1000 FOLLOWING
-- 假设当前close值为3000,语句将包含分区内范围从2500到4000的行
-- 示例
SELECT name, dept_num AS dept, salary AS sal,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
salary - 1000 as sal_r_start,salary as sal_r_end,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name RANGE BETWEEN 1000 PRECEDING AND CURRENT ROW) win13
FROM employee_contract ORDER BY dept, name;