hive sql 强转 hive-sql

转载

笑傲江湖求败 2023-12-28 13:27:20

文章标签 hive sql 强转字段 hive HDFS 文章分类 Hive 大数据

内容大纲:
1. HQL DML语句 //操作表数据的
load加载数据
insert插入数据
单次插入
多重插入
数据导出
2. HQL DQL数据查询语言
cluster by
distribute by + sort by //它们加起来等价于 cluster by(分桶查询)
order by
union
CTE
Hive的join语法
3. Hive的Shell命令行和参数配置方式
4. 掌握Hive的函数
//函数分类：内置函数、用户定义函数
//UDF、UDTF、UDAF

********************************* 具体内容 *********************************

HQL DML语句_load方式加载数据
案例:
演示DML相关操作: 数据操作语言, 主要是对表数据进行更新操作(增, 删, 改).
具体的HQL语句:
show functions; //查看hive支持的所有函数.
准备动作:
//创建数据表, 3张表. 源数据格式: 95001,李勇,男,20,CS
//建表student_local 用于演示从本地加载数据
create table student_local(num int,name string,sex string,age int,dept string) row format delimited fields terminated by ',';

//建表student_HDFS  用于演示从HDFS加载数据到非分区表
 create table student_HDFS(num int,name string,sex string,age int,dept string) row format delimited fields terminated by ',';
 
 //建表student_HDFS_p 用于演示从HDFS加载数据到分区表
 create table student_HDFS_p(num int,name string,sex string,age int,dept string) partitioned by(country string) 
 row format delimited fields terminated by ',';

案例: 演示添加表数据的方式.
思路1: //通过 load data... 方式添加, 分别通过不同的方式往上述的三张表中上传数据.
//方式1: 通过本地(Linux系统)加载数据.
load data local inpath '/export/data/hivedata/students.txt' into table student_local;

//方式2: HDFS文件系统.
 	hadoop fs -put /export/data/hivedata/students.txt /user/hive/warehouse/itheima.db/student_hdfs

 	//方式3: 往分区表中添加数据.
 	load data local inpath '/export/data/hivedata/students.txt' into table student_HDFS_p partition(country="China");

 	//方式4: 从HDFS文件系统的文件中加载数据到 hive表中.
 	create table student_HDFS_2(num int,name string,sex string,age int,dept string) row format delimited fields terminated by ',';
 	
 	//如果带local就是本地文件系统(例如: Linux系统), 如果不带local, 就是HDFS文件系统.
 	load data inpath '/data/students.txt' into table student_HDFS_2;

HQL DML语句_插入表数据之单次插入(重点)

创建一张源表student
drop table if exists student; //表如果存在就删除
//创建表的动作
create table student(
num int,name string,sex string,age int,dept string
) row format delimited fields terminated by ',';
加载数据
load data local inpath '/export/data/hivedata/students.txt' into table student;
查询表数据.
select * from student;
创建一张目标表只有两个字段
create table student_insert1(sno int,sname string);
使用insert+select插入数据到新表中(重点!!!)
insert into table student_insert1
select num,name from student;
//格式:insert into table 表名 select 字段... from 数据源的表名
//相当于吧select列出的字段数据一个个往上面给,所以字段顺序一定要一致!
查询结果
select * from student_insert1;

HQL DML语句_插入表数据之多重插入(重点)
背景:
刚才我们是把student表中的数据插入到 student_insert1这一张表中, 如果是多张表该如何实现呢?
例如:
把student表中的数据, 根据字段分别插入到 student_insert2, student_insert3 表中.
步骤:
1. 创建 student_insert2, student_insert3 表.
create table student_insert2(sname string) row format delimited fields terminated by ',';
create table student_insert3(sno int, sex string) row format delimited fields terminated by ',';

2.重点重点 从student表中, 插入数据到 student_insert2, student_insert3 两张表中.

 	方式: //多重插入, 即: 一次扫描, 多次插入
 		from student
 		insert into student_insert2
 		select name     //多个insert之间没有逗号,加逗号相当于逗号后面有字段名没写,肯定报格式错
 		insert into student_insert3
 		select num, sex;
 		
 		//格式相当于单次插入中的from 数据源表提到前面

HQL DML语句_插入表数据之导出表数据
功能：
把select查询的结果导出成为一个文件。
注意：
导出操作是一个overwrite(复写,重写)操作，它会覆盖掉指定文件夹下所有的内容,不亚于删除操作,可能会让你凉凉。慎重！！！
步骤:
1. 当前库下已有一张表student
select * from student;

2. 导出查询结果到HDFS指定目录下
 	insert overwrite directory '/tmp/hive_export/e1' select num,name,age from student limit 2;   //默认导出数据字段之间的分隔符是\001

 3. 导出时指定分隔符和文件存储格式
 	-- 格式: insert overwrite directory 'hdfs路径' select * from 表名;
 
 	insert overwrite directory '/tmp/hive_export/e2' row format delimited fields terminated by ','
 	stored as orc			//stored:存储     这里表示设置存储方式为列式存储(二进制),    默认: textfile(行式存储)
 	select * from student;

 4. 导出数据到本地文件系统指定目录下(加个local关键字)
 	
 	-- 格式: insert overwrite local directory 'linux本地路径' select * from 表名;
 	
 	insert overwrite local directory '/export/data/hivedata/e1' select * from student;
 	
 5. 细节: 
 
 	1. 目的地目录(aa)可以不存在, 会自动创建.    
 	2. 导出后, HDFS文件的分隔符是 默认分隔符'\001'
 	3. 我们可以在导出的时候, 指定: 导出文件的 字段分隔符.      
 	4. 若文件存在,则操作是:覆盖.
 	5.若文件已经分区,则只覆盖对应的分区.

Hive-DQL-基本查询详解(准备动作)
准备动作:
1. 创建数据库
create database day09;
2. 切换数据库
use day09;
传统(MySQL)的SQL语句和 HQL语句关于单表查询语句格式对比(重点!!!):
/*
背下来!!!!!!!!!!!!
传统(MySQL)的SQL语句和 HQL语句关于单表查询语句格式对比如下:
传统(MySQL)的SQL语句:
select distinct 列1, 列2... from 表名
where 组前筛选
group by 分组字段1, 分组字段2...
having 组后筛选
order by 排序的字段1 asc/desc, 排序的字段2 asc/desc -- 全局排序.
limit 起始索引, 每页的数据条数;

HQL语句:
 	[with CTE公共表达式]   Common Table Expression公共表达式的意思.
 	select distinct/all 列1, 列2... from 表名
 	where 组前筛选
 	group by 分组字段1, 分组字段2...
 	having 组后筛选
 	order by 排序的字段1 asc/desc, 排序的字段2 asc/desc           -- 全局排序.
 	cluster by 字段 / distribute by 字段 sort by 字段           -- 分桶查询  cluster by =  distribute by +  sort by   
 																								  //cluster by表示根据A字段分根据A排序,distribute by可以做到根据A字段分根据B字段排序
 	limit 起始索引, 每页的数据条数;
 	
 	总结:HQL比传统SQL多了:公共表达式(with CTE公共表达式)和分桶(cluster by )

准备数据的步骤(不用看):		//创建普通表 t_usa_covid19, 分区表: t_usa_covid19_p(按照统计时间还有洲进行分区)
	1. 创建普通表t_usa_covid19
		drop table if exists t_usa_covid19;		//如果表存在, 就删除. 
		
		CREATE TABLE t_usa_covid19 (
			count_date string,
			county     string,
			state      string,
			fips       int,
			cases      int,
			deaths     int
		) row format delimited fields terminated by ",";
	
	2. 将源数据load加载到t_usa_covid19表对应的路径下
		load data local inpath '/export/data/hivedata/us-covid19-counties.dat' into table t_usa_covid19;
	
	3. 查询表数据.
		select * from t_usa_covid19;

	4. 创建一张分区表 基于count_date日期,state州进行分区
		CREATE TABLE if not exists t_usa_covid19_p(
			county string,
			fips   int,
			cases  int,
			deaths int
		) partitioned by (count_date string, state string)
		row format delimited fields terminated by ",";

	5. 使用动态分区插入将数据导入t_usa_covid19_p中
		set hive.exec.dynamic.partition.mode = nonstrict;

		insert into table t_usa_covid19_p partition(count_date, state)
		select county, fips, cases, deaths, count_date, state from t_usa_covid19;

	6. 小细节, 建议大家开启本地优化模式(这里我就不开启了), 可以提高部分hiveSQL的执行效率.
		set hive.exec.mode.local.auto; 			//查看该配置项的值
		set hive.exec.mode.local.auto=true;		//设置该配置项的值, 临时设置, 当本次会话结束,改配置就不可用了.
												//如果开启本地模式, 则查询分桶表的时候, 必须手动设置分桶的个数, 否则结果可能跟你想的不一样.

Hive-DQL-基本查询详解(不用管,上面格式背好即可,要练习的话,打开课堂的SQL语句敲即可)

全查询
select * from t_usa_covid19;
列筛选, 即: 只查询指定的列.
select state from t_usa_covid19; //获取所有的洲
select all state from t_usa_covid19; //上述代码的完整格式是
select * from t_usa_covid19 where state = 'New York'; //查询New York洲各县的所有的信息
//查询美国各州的确诊人数和死亡人数.
select state, sum(cases) total_cases, sum(deaths) total_deaths
from t_usa_covid19 group by state;
//演示分组后的筛选, 只统计死亡人数在100及以上的信息.
select state, sum(cases) total_cases, sum(deaths) total_deaths
from t_usa_covid19 group by state having total_deaths >= 100;
//演示分组前的筛选, 即: 只统计 Georgia,Illinois,Kentucky,Texas,Virginia洲的信息.
select state, sum(cases) total_cases, sum(deaths) total_deaths
from t_usa_covid19
where state in ('Georgia', 'Illinois', 'Kentucky', 'Texas', 'Virginia')
group by state having total_deaths >= 100;
//演示排序, 按照确诊总人数降序排序.
select state, sum(cases) total_cases, sum(deaths) total_deaths
from t_usa_covid19
where state in ('Georgia', 'Illinois', 'Kentucky', 'Texas', 'Virginia')
group by state having total_deaths >= 100 order by total_cases desc;
//最终需求: 只要前两条信息.
select state, sum(cases) total_cases, sum(deaths) total_deaths
from t_usa_covid19
where state in ('Georgia', 'Illinois', 'Kentucky', 'Texas', 'Virginia')
group by state
having total_deaths >= 100
order by total_cases desc
limit 0, 2;

Hive-DQL-分桶查询(cluster,distribute,sort)详解(重点)

默认的分桶查询方式: cluster by 分桶字段, 即会根据这个字段分桶, 也会根据这个字段排序(升序)
use itheima;
根据学生的性别分桶
select * from student cluster by sex;
根据学生的编号分桶.
//设置MapReduce的个数, 有几个, 就能分成几个桶, 默认是-1, 即: 按照你的计算量, hive自动划分.
set mapreduce.job.reduces=3;
select * from student cluster by sno;

需求:
根据sno分桶, 根据age降序排列.

//为了方便用户的需求, hive对cluster的功能做了拆分:
 //distribute by 负责分,  sort by 负责 排序, 即: cluster = distribute + sort
 //select * from student cluster by sno order by age desc;
 //select * from student cluster by sno sort by age desc;

 select * from student cluster by sno;
 //上述的语句相当于
 select * from student distribute by sno sort by sno;

实现需求:
需求: 根据sno分桶, 根据age降序排列.

select * from student distribute by sno sort by sage desc;

小细节, 关于order by 和 sort by的区别:
//order by: 是做全局排序的, 不管你设置了几个MapReduce, 结果只会执行1个.
//sort by: 是用于做分桶后的, 每个桶内部数据的排序的, 有几个MapReduce,就有几个桶.
set mapreduce.job.reduces=3;
select * from student order by sage desc;

Hive-DQL-联合查询详解(重点!)
/*
细节:
1. Hive1.2.0之前只支持 union all, 即: 全部联合查询, 不会去除重复数据.
2. hive1.2.0之后加入了 union distinct, 默认会对重复数据进行去重.
3. 要进行union操作的表的字段和列名必须保持一致.
4. 如果只写union,则默认是: union distinct 也就是默认去重
*/

我们这里用 student_local 和 student_hdfs做演示.
select * from student_local;
select * from student_hdfs;
演示直接写 union
select * from student_local
union
select * from student_hdfs;
上述的SQL, 等价于
select * from student_local
union distinct
select * from student_hdfs;
union all, //不会去除重复数据.
select * from student_local
union all
select * from student_hdfs;

5.重点如果要将order by，sort by，cluster by，distribute by或limit应用于整个select, 请将子句放在整个select查询的最后
select * from student_local
union all
select * from student_hdfs limit 2; //对整个数据取前两条

6.重点如果要将order by，sort by，cluster by，distribute by或limit应用于单个select, 请将子句放在括住select的括号内
select * from (select * from student_local limit 2) t1
union all
select * from (select * from student_hdfs limit 3) t2;
//第一个数据取2条,第二个取3条,一共5条数据

Hive-DQL-CTE表达式详解(重点)
1.重点 select 语句中的CTE功能.
with t1 as (select * from student where sno in (95001, 95002, 95003, 95004, 95005, 95006))
select * from t1;

//重点:with t1 as(查询到的数据)  字面意思为用t1 作为 ()里的东西,也就是()里的数据被t1这个临时表存储下来了
 //即: 相当于创建一张临时表, 该表的数据可以被整个SQL语句中使用, 且可以重复使用.

2.注意哦 from格式(相当于前面的多重插入的那种格式,把from调到前面去了)
with t1 as (select * from student where sno in (95001, 95002, 95003, 95004, 95005))
from t1
select *;

3.重点 CTE chars, 链式 (相当于在t1的基础上再做过滤时,还要用临时表存储,此时with可省略)
with t1 as ( select * from student), -- 使用CTE链式编程的时候, 记得要写逗号.
t2 as ( select sno, sname, sage from t1)
select * from t2 limit 0, 5;

-- 上述写法等价于, 子查询写法之:
 select * from (select sno, sname, sage from (select * from student limit 0, 5) t2) t1;

union 联合查询.
with t1 as (select * from student where sno = 95001),
t2 as (select * from student where sno = 95002)
select * from t1
union
select * from t2;
ctas(creat table as select) 创建一张表来自于后面的查询语句
//表的字段个数名字顺序和数据行数都取决于查询
create table tmp1 as
with t1 as (select * from student where sno in (95001, 95002, 95003, 95004, 95005, 95006))
select * from t1;
//查询表数据
select * from tmp1;
创建视图
create view v1 as
with t1 as (select * from student where sno in (95001, 95002, 95003, 95004, 95005, 95006))
select * from t1;
select * from v1; //查询表数据

Hive-DQL-join查询详解(6种重点)
//切换库
use day09;
show tables;

准备源数据表.
//table1: 员工表
CREATE TABLE employee (
id int,
name string,
deg string,
salary int,
dept string
) row format delimited fields terminated by ',';
//table2: 员工家庭住址信息表
CREATE TABLE employee_address (
id int,
hno string,
street string,
city string
) row format delimited fields terminated by ',';
//table3:员工联系方式信息表
CREATE TABLE employee_connection (
id int,
phno string,
email string
) row format delimited fields terminated by ',';
加载数据到表中
load data local inpath '/export/data/hivedata/employee.txt' into table employee;
load data local inpath '/export/data/hivedata/employee_address.txt' into table employee_address;
load data local inpath '/export/data/hivedata/employee_connection.txt' into table employee_connection;
查询表数据
select * from employee;
select * from employee_address;
select * from employee_connection;

4.重点重点各种多表查询, 掌握: 内连接, 左外连接.

4.1 第1种: (内)连接查询, 查询的是: 两张表的交集. inner join
	select e1.*, e2.* from employee e1
	inner join
	employee_address e2 on e1.id = e2.id;

	//显式内连接,上述语句可以省略inner,如下: 
	select e1.*, e2.* from employee e1 join employee_address e2 on e1.id = e2.id;

	//隐式内连接,上述语句可以改为如下:   (直接逗号隔开两个表,然后用where 感觉这个格式就是懒得写inner哈哈哈)
	select e1.*, e2.* from employee e1, employee_address e2 where e1.id = e2.id;

	
	
4.2 第2种: 左外连接:  左表的全集 + 表的交集.
	select e1.*, e2.* from employee e1
	left outer join 	//outer可以省略
	employee_address e2 on e1.id = e2.id;

	
	
4.3 第3种: 右外连接:  右表的全集 + 表的交集.
	select e1.*, e2.* from employee e1
	right outer join 	// outer可以省略
	employee_address e2 on e1.id = e2.id;

	
	
4.4 第4种: 满外连接查询, 外连接 全外连接 full join == full outer join
	//full outer join 关键字只要左表（table1）和右表（table2）其中一个表中存在匹配，则返回行.
	//full outer join 关键字结合了 left join 和 right join 的结果。
	
	select e1.*, e2.* from employee e1
	full outer join 	//outer可以省略
	employee_address e2 on e1.id = e2.id;

	
	
	
4.5 第5种: 左半连接, left semi join
	//相当于 inner join, 但是只返回左表全部数据，右表数据全部抛弃, 只不过效率高一些
	应用场景: 
		统计每天的新用户或者老用户.
	select * from employee e left semi join employee_address e_addr on e.id = e_addr.id;

	
	
4.6 第6种: cross join 交叉查询, 相当于查询表的笛卡尔积, 无意义, 一般不用.
	select * from employee e cross join employee_address e_addr;

Hive Shell命令行(了解)
分类:
1. 批处理：一次连接，一次交互，执行结束断开连接
2. 交互式处理：保持持续连接，一直交互
注意：
如果说hive的shell客户端, 指的是第一代客户端bin/hive
而第二代客户端bin/beeline属于JDBC客户端不是shell。
关于hive的第一代客户端 ./hive 的作用:
/*
概述:
cd /export/server/apache-hive-3.1.2-bin/bin/hive(第1代客户端) 它的底层是用Shell脚本写的, 主要的作用有 4种.
作用:
1. 可以充当hive的客户端使用, 即: 第1代客户端, 直接连接 metastore 元数据服务即可连接Hive.
2. 可以用来开启某些服务, 例如: metastore, hiveserver2
nohup ./hive --service metastore &
nohup ./hive --service hiveserver2 &
3. 它可以执行hiveSQL语句.
./hive -e "HQL语句" 注意: 只执行一次.
4. 它可以执行HiveSQL脚本.
./hive -f HQL脚本.sql 这个才是生产环境中, 真正的用法, 后续的这个 HQL脚本文件的后缀名可以是任意后缀名, 但是里边的HQL语句格式要正确.
*/
Hive的参数配置方式详解

Hive有哪些参数可以配置？
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
配置方式有哪些？
见图片:三种方式
总结

从方式1到方式3, 影响的范围是越来越小的。
从方式1到方式3 优先级越来越高。set命令设置的会覆盖其他的。
Hive作为的基于Hadoop的数仓，也会把 Hadoop 的相关配置解析加载进来。

Apache Hive-内置运算符(多看看)
官方链接：
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF一些Hive命令:
show functions; //显示所有的函数和运算符
describe function +; //查看运算符或者函数的使用说明
describe function extended +; //使用extended 可以查看更加详细的使用说明
具体分类:
1. 关系运算符
2. 算术运算符
3. 逻辑运算符
参考代码:
-- 1、创建表dual
create table dual(id string);

-- 2、加载一个文件dual.txt到dual表中
-- dual.txt只有一行内容：内容为一个空格

load data local inpath '/root/hivedata/dual.txt' into table dual;

-- 3、在select查询语句中使用dual表完成运算符、函数功能测试
select 1+1 from dual;
select 1+1;

//----------------Hive中关系运算符--------------------------
// is null空值判断
select 1 from dual where 'itcast' is null;

// is not null 非空值判断
select 1 from dual where 'itcast' is not null;

// like比较： _表示任意单个字符 %表示任意数量字符
// 否定比较： NOT A like B
select 1 from dual where 'itcast' like 'it_';
select 1 from dual where 'itcast' like 'it%';
select 1 from dual where  'itcast' not like 'hadoo_';
select 1 from dual where  not 'itcast' like 'hadoo_';

// rlike：确定字符串是否匹配正则表达式，是regexp_like()的同义词。
select 1 from dual where 'itcast' rlike '^i.*t$';
select 1 from dual where '123456' rlike '^\\d+$';  // 判断是否全为数字
select 1 from dual where '123456aa' rlike '^\\d+$';

// regexp：功能与rlike相同 用于判断字符串是否匹配正则表达式
//手机号是否合法.  规则: 1.必须是11位.  2.必须是纯数字.  3.第1个数字必须是1.  4.第二位数字可以是 3 ~ 9
select '13112345678' regexp '^1[3-9]\\d{9}$';

// Hive中算术运算符----------------
// 取整操作: div  给出将A除以B所得的整数部分。例如17 div 3得出5。
select 17 div 3;

// 取余操作: %  也叫做取模mod  A除以B所得的余数部分
select 17 % 3;

// 位与操作: &  A和B按位进行与操作的结果。 与表示两个都为1则结果为1
select 4 & 8 from dual;  // 4转换二进制：0100 8转换二进制：1000
select 6 & 4 from dual;  // 4转换二进制：0100 6转换二进制：0110

// 位或操作: |  A和B按位进行或操作的结果  或表示有一个为1则结果为1
select 4 | 8 from dual;
select 6 | 4 from dual;

// 位异或操作: ^ A和B按位进行异或操作的结果 异或表示两者的值不同,则结果为1
select 4 ^ 8 from dual;
select 6 ^ 4 from dual;


// 3、Hive逻辑运算符
// 与操作: A AND B   如果A和B均为TRUE，则为TRUE，否则为FALSE。如果A或B为NULL，则为NULL。
select 1 from dual where 3>1 and 2>1;

// 或操作: A OR B   如果A或B或两者均为TRUE，则为TRUE，否则为FALSE。
select 1 from dual where 3>1 or 2!=2;

// 非操作: NOT A 、!A   如果A为FALSE，则为TRUE；如果A为NULL，则为NULL。否则为FALSE。
select 1 from dual where not 2>1;
select 1 from dual where !2=1;

// 在:A IN (val1, val2, ...)  如果A等于任何值，则为TRUE。
select 1 from dual where 11  in(11,22,33);

// 不在:A NOT IN (val1, val2, ...) 如果A不等于任何值，则为TRUE
select 1 from dual where 11 not in(22,33,44);

扩展:
/*
二进制和十进制快速转换法, 8421码表:
二进制数据: 0 0 0 0 0 0 0 0
对应的十进制数据: 128 64 32 16 8 4 2 1

二进制转十进制案例, 求: 101011的十进制, 其实相当于求 0010 1011的十进制 = 32 + 8 + 2 + 1 = 43
十进制转二进制案例, 求: 56的二进制, 56 = 32 + 16 + 8 = 0011 1000

细节: 第1位是符号位, 0:正数, 1:负数,  其它的是数值位.

0100        4的二进制
0110        6的二进制

& -----------------
0100 4
*/

Apache Hive-函数的分类(UDF, UDAF, UDTF)(理解)
关于Hive的函数的分类.
/*
Hive中的函数主要分为: 内置函数和自定义函数, 但是整个是早期的分发, 为了更好的划分用户自定义函数, Hive又把函数分为三大类, 分别如下:
内置函数:
用户自定义函数:
UDF: 普通函数, 一进一出. 即: 输入1行, 输出1行.
select split('aa,bb,cc,dd', ','); -- 按照,切割, ["aa","bb","cc","dd"]

UDAF:   聚合函数, 多进一出.     即: 输入多行, 输出1行.
			select count(1) from student;

		UDTF:   表生成函数, 一进多出.   即: 输入1行, 输出多行.
			select explode(array('aa', 'bb', 'cc'));    -- explode()炸裂函数.

后来Hive发现用 UDF, UDAF, UDTF来划分函数非常方便, 于是有了 函数扩大化的概念, 即: 本来UDF, UDAF, UDTF是用来划分 用户自定义函数的,
现在 UDF, UDAF, UDTF 是用来划分Hive中所有函数的, 包括内置函数 和 用户自定义函数.

Apache Hive-常用的内置函数详解
字符串函数:
// 字符串截取函数：substr(str, pos[, len]) 或者 substring(str, pos[, len])
select substr("angelababy",-2); --pos是从1开始的索引，如果为负数则倒着数
select substr("angelababy",2,2);

//正则表达式替换函数：regexp_replace(str, regexp, rep)
select regexp_replace('100-200', '(\\d+)', 'num'); --正则分组

//URL解析函数：parse_url 注意要想一次解析出多个 可以使用parse_url_tuple这个UDTF函数
select parse_url('http://www.itcast.cn/path/p1.php?query=1', 'HOST');

//分割字符串函数: split(str, regex)
select split('apache hive', '\\s+');	//匹配一个或者多个空白符

//json解析函数：get_json_object(json_txt, path)
//$表示json对象
select get_json_object('[{"website":"www.itcast.cn","name":"allenwoon"}, {"website":"cloud.itcast.com","name":"carbondata 中文文档"}]', '$.[1].website');

日期函数
// 获取当前日期: current_date
select current_date();
// 获取当前时间戳: current_timestamp
// 同一查询中对current_timestamp的所有调用均返回相同的值。
select current_timestamp();
// 获取当前UNIX时间戳函数: unix_timestamp
select unix_timestamp();
// 日期转UNIX时间戳函数: unix_timestamp
select unix_timestamp("2011-12-07 13:01:03");
// UNIX时间戳转日期函数: from_unixtime
select from_unixtime(1620723323);
select from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');
// 日期比较函数: datediff 日期格式要求'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'
select datediff('2012-12-08','2012-05-09');
// 日期增加函数: date_add
select date_add('2012-02-28',10);
// 日期减少函数: date_sub
select date_sub('2012-01-1',10);

数字函数:
// 取整函数: round 返回double类型的整数值部分（遵循四舍五入）
select round(3.1415926);
// 指定精度取整函数: round(double a, int d) 返回指定精度d的double类型
select round(3.1415926,4);
// 向下取整函数: floor
select floor(3.1415926);
select floor(-3.1415926);
// 向上取整函数: ceil
select ceil(3.1415926);
select ceil(-3.1415926);
// 取随机数函数: rand 每次执行都不一样返回一个0到1范围内的随机数
select rand();
// 指定种子取随机数函数: rand(int seed) 得到一个稳定的随机数序列
select rand(5);

条件函数:
// if条件判断: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
select if(1=2,100,200);
select if(sex ='男','M','W') from student limit 3;

// 空判断函数: isnull( a )
select isnull("allen");
select isnull(null);

// 非空判断函数: isnotnull ( a )
select isnotnull("allen");
select isnotnull(null);

// 空值转换函数: nvl(T value, T default_value)
select nvl("allen","itcast");
select nvl(null,"itcast");

// 非空查找函数: COALESCE(T v1, T v2, ...)
// 返回参数中的第一个非空值；如果所有值都为NULL，那么返回NULL
select COALESCE(null,11,22,33);
select COALESCE(null,null,null,33);
select COALESCE(null,null,null);

// 条件转换函数: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end;
select case sex when '男' then 'male' else 'female' end from student limit 3;

Type Conversion Functions 类型转换函数:
前置知识: Hive中支持类型的隐式转换有限制自动转换不保证成功就显示null
//cast显示类型转换函数

--任意数据类型之间转换:cast
select cast(12.14 as bigint);
select cast(12.14 as string);
select cast("hello" as int);
+-------+
|  _c0  |
+-------+
| NULL  |
+-------+

Data Masking Functions 数据脱敏函数
//mask脱敏掩码处理, 数据脱敏：让敏感数据不敏感, 13455667788 --->134****7788
// mask, 将查询回的数据，大写字母转换为X，小写字母转换为x，数字转换为n。
select mask("abc123DEF");
select mask("abc123DEF",'-','.','^'); // 自定义替换的字母

// mask_first_n(string str[, int n]
// 对前n个进行脱敏替换
select mask_first_n("abc123DEF",4);

// mask_last_n(string str[, int n])
select mask_last_n("abc123DEF",4);

// mask_show_first_n(string str[, int n])
// 除了前n个字符，其余进行掩码处理
select mask_show_first_n("abc123DEF",4);

// mask_show_last_n(string str[, int n])
select mask_show_last_n("abc123DEF",4);

// mask_hash(string|char|varchar str)
// 返回字符串的hash编码。
select mask_hash("abc123DEF");

Misc. Functions 其他杂项函数、加密函数
// 如果你要调用的java方法所在的jar包不是hive自带的可以使用add jar添加进来
// hive调用java方法: java_method(class, method[, arg1[, arg2..]])
select java_method("java.lang.Math","max",11,22);

// 反射函数: reflect(class, method[, arg1[, arg2..]])
select reflect("java.lang.Math","max",11,22);

// 取哈希值函数:hash
select hash("allen");

// current_user()、logged_in_user()、current_database()、version()

// SHA-1加密: sha1(string/binary)
select sha1("allen");

// SHA-2家族算法加密：sha2(string/binary, int)  (SHA-224, SHA-256, SHA-384, SHA-512)
select sha2("allen",224);
select sha2("allen",512);

// crc32加密:
select crc32("allen");

// MD5加密: md5(string/binary)
select md5("allen");

扩展(重重点!!!!):
24. 说一下你对分页的四个参数的理解. //直白一点问, 分页的时候, 如何计算总页数?
总数据条数: //select count(1) from 表名; 假设: 23条
每页的数据条数: //产品经理, 项目经理, 老大定的. 假设: 5条
每页的起始索引: //公式: (当前页数 - 1) * 每页的数据条数第3页: 起始索引 10
总页数: //公式: (总页数 + 每页的数据条数 - 1) /(取整除) 每页的数据条数总页数: 5页 (25 + 5 - 1) / 5 = 5 10 / 3 = 3 即: 整除

count(字段), count(1), count(*)的区别是什么?

效率对比, 从高到低分别是:
count(索引列) > count(1) > count(字段) > count(*)
关于是否统计null数据
count(1), count(*), 会统计null数据, 哪怕整行数据都是null, 也统计.
count(字段), 只统计该列(字段)的非null值.

order by, cluster by, distribute by, sort by之间的区别是什么?
/*
order by, cluster by, distribute by, sort by之间的区别如下:
order by: 全局排序.
cluster by: 分桶查询, 且按照分桶字段排序. 即: 分桶字段和排序字段必须是同一个, cluster by = distribute by + sort by
distribute by: 分桶查询中(分桶 = 分文件), 负责分的部分, 类似于: 传统MySQL的 group by语句.
sort by: 排序的, 局部排序, 即: 对桶内的数据进行排序.
细节:
1. 如果要进行分桶查询, 则: 有几个分桶, 就要设置几个 ReduceTask. 回想MR核心8步: 有几个ReduceTask任务, 就有几个文件.
2. 通过如下的命令, 可以设置MR程序的 ReduceTask任务的个数, 默认是-1, 表示有Hive根据我们的数据量(默认维度)来决定给几个桶.
set mapreduce.job.reduces = 数字; 设置ReduceTask任务的数量, 默认是-1
*/