hive的hex函数 hive 函数大全

转载

西门吹雪 2023-05-24 10:26:17

文章标签 hive 大数据字符串 Hive unix 文章分类 Hive 大数据

Hive 中的函数概览

Hive 内置函数

Hive 内置函数说明
Hive 数学函数
Hive 字符串函数
Hive 日期函数
Hive 条件函数
Hive 转换函数
Hive 行转列
Presto 分组函数（GROUPING SETS、CUBE、ROLLUP）

Hive 数列操作函数

array_repeat 函数
element_at 函数

Hive 表生成函数

stack 函数
LATERAL VIEW侧视图
Reflect函数

Hive 开窗函数

窗口函数(1) NTILE,ROW_NUMBER,RANK,DENSE_RANK
Hive分析窗口函数(2) SUM,AVG,MIN,MAX
Hive分析窗口函数(3) LAG,LEAD,FIRST_VALUE,LAST_VALUE
窗口累加：order by 的不同效果
多个分组的数据聚合：窗口函数 + 分组
聚合函数中的判断条件写法
rows between 设置窗口范围

Hive 自定义函数

Hive 内置函数

Hive 内置函数说明

函数较多，可以参考官方文档

LanguageManual UDF - Apache Hive - Apache Software Foundation

查看系统自带的函数

show functions;

显示自带的函数的用法

desc function round;

hive的hex函数 hive 函数大全_大数据

详细显示自带的函数的用法

desc function extended round;

hive的hex函数 hive 函数大全_unix_02

Hive 数学函数

取整函数: round

语法: round(double a)
返回值: BIGINT
说明:返回double类型的整数值部分（遵循四舍五入）

-- 取整，结果：3
select round(3.1415926);

hive的hex函数 hive 函数大全_Hive_03

指定精度取整函数: round

语法: round(double a, int d)
返回值: DOUBLE
说明:返回指定精度d的double类型

-- 取整，精度为4，结果：3.1416
select round(3.1415926,4);

向下取整函数: floor

语法: floor(double a)
返回值: BIGINT
说明:返回等于或者小于该double变量的最大的整数

-- 向下取整，结果：3
select floor(3.1415926);

# 四舍五入为指定的小数位数，带精度
> SELECT round(2.5, 0);
 3.0
 
> SELECT round(2.5, 1);
2.5

# 四舍五入为整数
> SELECT round(2.5);
 3

向上取整函数: ceil

语法: ceil(double a)
返回值: BIGINT
说明:返回等于或者大于该double变量的最小的整数

-- 向上取整，结果：4
select ceil(3.1415926);

获取随机数函数: rand

语法: rand(),rand(int seed)
返回值: double
说明:返回一个0到1范围内的随机数。如果指定种子seed，则会返回固定的随机数

-- 获取随机数，结果：0.5577432776034763
select rand();

-- 获取随机数，结果：0.6638336467363424
select rand();

-- 获取随机数，指定种子，结果：0.7220096548596434
select rand(100);

-- 获取随机数，指定种子，结果：0.7220096548596434
select rand(100);

幂运算函数: pow

语法: pow(double a, double p)
返回值: double
说明:返回a的p次幂

-- 获取 2的4次方，结果：16.0
select pow(2,4) ;

绝对值函数: abs

语法: abs(double a)  abs(int a)
返回值: double       int
说明:返回数值a的绝对值

-- 获取绝对值，结果：3.9
select abs(-3.9);

-- 获取绝对值，结果：10.9
select abs(10.9);

取最小值函数：least

取几个数中的最小值，不是 min 函数

-- 取 4 个数中的最小值
select least(1,2,3,4);

取最大值函数：greatest

取几个数中的最大值，除去 null 值

注意：max 函数不能比较 null 值

> select greatest(null,1,9,7,4);
> 9

Hive 字符串函数

字符串长度函数：length

语法: length(string A)
返回值: int
说明：返回字符串A的长度

-- 返回字符串长度，结果：7
select length('abcedfg');

hive的hex函数 hive 函数大全_hive_04

字符串反转函数：reverse

语法: reverse(string A)
返回值: string
说明：返回字符串A的反转结果

-- 字符串反转，结果：gfdecba
select reverse('abcedfg');

字符串连接函数：concat

语法: concat(string A, string B…)
返回值: string
说明：返回输入字符串连接后的结果，支持任意个输入字符串
abcdefgh
-- 字符串连接，结果：abcdefgh
select concat('abc','def','gh');

字符串连接函数-带分隔符：concat_ws

语法: concat_ws(string SEP, string A, string B…)
返回值: string
说明：返回输入字符串连接后的结果，SEP表示各个字符串间的分隔符

-- 以 ',' 为分隔符，连接多个字符串，结果：abc,def,gh
select concat_ws(',','abc','def','gh');

字符串截取函数：substr,substring

语法: substr(string A, int start),substring(string A, int start)
返回值: string
说明：返回字符串A从start位置到结尾的字符串

-- 起始位置为3截取，从1开始，包含起始位置，结果：cde
select substr('abcde',3);

-- 起始位置为3截取，从1开始，包含起始位置，结果：cde
select substring('abcde',3);

-- 从右往左，第一个位置为起始位置，结果：e
select substr('abcde',-1);

字符串截取函数：substr,substring

语法: substr(string A, int start, int len),substring(string A, intstart, int len)
返回值: string
说明：返回字符串A从start位置开始，长度为len的字符串

-- 从左往右，起始位置为3，长度为2，结果：cd
select substr('abcde',3,2);

-- 从左往右，起始位置为3，长度为2，结果：cd
hive> select substring('abcde',3,2);

-- 从右往左，起始位置为2，长度为2，结果：de
hive>select substring('abcde',-2,2);

-- 从左往右，获取年信息
hive> select substr('2019-07-02 00:00:00.0',0,4);

hive的hex函数 hive 函数大全_hive_05

字符串转大写函数：upper,ucase

语法: upper(string A) ucase(string A)
返回值: string
说明：返回字符串A的大写格式

-- 变大写，结果：ABSED
select upper('abSEd');

-- 变大写，结果：ABSED
select ucase('abSEd');

字符串转小写函数：lower,lcase

语法: lower(string A) lcase(string A)
返回值: string
说明：返回字符串A的小写格式

-- 变小写，结果：absed
select lower('abSEd');

-- 变小写，结果：absed
select lcase('abSEd');

去空格函数：trim

语法: trim(string A)
返回值: string
说明:去除字符串两边的空格

-- 去除空格，结果：abc
select trim(' abc ');

左边去空格函数：ltrim

语法: ltrim(string A)
返回值: string
说明：去除字符串左边的空格

-- 去除左边空格，结果：'abc  '
select ltrim(' abc ');

右边去空格函数：rtrim

语法: rtrim(string A)
返回值: string
说明：去除字符串右边的空格

-- 去除右边空格，结果：'  abc'
select rtrim(' abc ');

正则表达式替换函数：regexp_replace

语法: regexp_replace(string A, string B, string C)
返回值: string
说明：将字符串A中的符合java正则表达式B的部分替换为C。注意，在有些情况下要使用转义字符,类似oracle中的regexp_replace函数

-- 正则替换，结果：fb
select regexp_replace('foobar', 'oo|ar', '');

hive的hex函数 hive 函数大全_大数据_06

-- 把日期字符串中的分隔符进行替换
select regexp_replace('2017/1/21','/','-');

hive的hex函数 hive 函数大全_Hive_07

URL解析函数：parse_url

语法: parse_url(string urlString, string partToExtract [, stringkeyToExtract])
返回值: string
说明：返回URL中指定的部分。partToExtract的有效值为：HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO.

-- 获取 url 中的 host，结果：facebook.com
select parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST');

-- 获取 url 中的 path，结果：/path1/p.php
select parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'PATH');

-- 获取 url 中的 参数，结果：v1
select parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY','k1');

分割字符串函数: split

语法: split(string str, stringpat)
返回值: array
说明:按照pat字符串分割str，会返回分割后的字符串数组

-- 字符串按照 't' 分割，结果：["ab","cd","ef"]
select split('abtcdtef','t');

字符串是否包含指定字符串： locate 函数

语法：locate(substr, str[, pos])
返回值：int
说明：如果返回值大于 0 ，说明字符串包含

-- 'foobarbar' 是否包含 'bar'
SELECT if(locate('bar', 'foobarbar')>0,'Y','N');

hive的hex函数 hive 函数大全_字符串_08

字符串是否包含指定字符串： instr 函数

-- 'select@@'' 是否包含 '@'
select instr('select@@','@');

-- '@select@'' 是否包含 '@'，如果包含，值大于1，值为1表示包含，值为0表示不包含
select if(instr('select@@','@')>0,1,0);

hive的hex函数 hive 函数大全_Hive_09

字符串填充

语法：locate(substr, str[, pos])
返回值：int
说明：如果返回值大于 0 ，说明字符串包含

-- 字符串一共有 4 位，在 'b' 的左侧填充 'a'，直至到达 4 位
select lpad('b',4,'a');

-- 字符串一共有 4 位，在 'b' 的右侧填充 'a'，直至到达 4 位
select rpad('b',4,'a');

hive的hex函数 hive 函数大全_字符串_10

Hive 日期函数

获取当前日期

> select current_date;
> 2021-09-04

日期格式化

注意：该函数返回的是字符串，想要截取字符串，必须要符合这种 yyyy-MM-dd HH:mm:ss 这个格式

语法：date_format()
返回值：格式化后的日期

-- 获取格式化后的日期，只需要到具体的月
select date_format('2017-1-21','yyyy-MM');

-- 获取格式化后的日期，具体到分钟
select date_format('2020-10-20 12:20:11','yyyy-MM-dd HH:mm')

hive的hex函数 hive 函数大全_字符串_11

hive的hex函数 hive 函数大全_unix_12

获取当前UNIX时间戳函数:unix_timestamp

语法: unix_timestamp()
返回值: bigint
说明:获得当前时区的UNIX时间戳

-- 获取当前的时间戳，结果：1323309615
select unix_timestamp();

-- 获取指定日期的时间戳，日期的格式必须为'yyyy-MM-dd HH:mm:ss'，不符合该格式则返回 null
select unix_timestamp('2019-08-15 16:40:00');

-- 获取指定日期的时间戳，指定日期的格式,不符合格式则返回 null
select unix_timestamp('2019-08-15','yyyy-MM-dd');
select unix_timestamp('2019-08-15 16:40:00','yyyy-MM-dd HH:mm:ss');

hive的hex函数 hive 函数大全_unix_13

hive的hex函数 hive 函数大全_字符串_14

UNIX时间戳转日期函数:from_unixtime

语法: from_unixtime(bigint unixtime[, string format])
返回值: string
说明:转化UNIX时间戳（从1970-01-01 00:00:00 UTC到指定时间的秒数）到当前时区的时间格式

-- 按照指定格式格式化时间戳，结果：20111208
select from_unixtime(1323308943,'yyyyMMdd');

日期转UNIX时间戳函数:unix_timestamp

语法: unix_timestamp(string date)
返回值: bigint
说明:转换格式为"yyyy-MM-ddHH:mm:ss"的日期到UNIX时间戳。如果转化失败，则返回0。

-- 日期转换为时间戳，结果：1323234063
select unix_timestamp('2011-12-07 13:01:03');

指定格式日期转UNIX时间戳函数:unix_timestamp

语法: unix_timestamp(string date, string pattern)
返回值: bigint
说明:转换pattern格式的日期到UNIX时间戳。如果转化失败，则返回0。

-- 获取时间戳，结果：1323234063
select unix_timestamp('20111207 13:01:03','yyyyMMddHH:mm:ss');

日期格式转换：unix_timestamp + from_unixtime

-- 将 yyyyMMdd 格式 转换为 yyyy-MM-dd
select from_unixtime( unix_timestamp('20211231','yyyyMMdd'),'yyyy-MM-dd');

hive的hex函数 hive 函数大全_unix_15

日期时间转日期函数:to_date

语法: to_date(string timestamp)
返回值: string
说明:返回日期时间字段中的日期部分

-- 获取date，结果：2011-12-08
select to_date('2011-12-08 10:03:01');

日期转年函数: year

语法: year(string date)
返回值: int
说明:返回日期中的年

-- 获取年，结果：2011
select year('2011-12-08 10:03:01');

-- 获取年，结果：2012
hive> select year('2012-12-08');

日期转季度函数

-- 获取季度，结果：3
select quarter('2021-07-11');  

-- 获取季度
select quarter('2019-07-02 00:00:00.0');

hive的hex函数 hive 函数大全_unix_16

日期转月函数: month

语法: month (string date)
返回值: int
说明:返回日期中的月份

-- 获取月，结果：12
select month('2011-12-08 10:03:01');

-- 获取月，结果：8
select month('2011-08-08');

日期转天函数: day

同样的，还有 hour,minute,second函数，分别是获取小时，分钟和秒，使用方式和以上类似，这里就不再讲述

语法: day (string date)
返回值: int
说明:返回日期中的天。

-- 获取天，结果：8
select day('2011-12-08 10:03:01');

-- 获取天，结果：24
select day('2011-12-24');

日期转周函数:weekofyear

语法: weekofyear (string date)
返回值: int
说明:返回日期在当前的周数

-- 获取周数，结果：49
select weekofyear('2011-12-08 10:03:01');

日期比较函数: datediff

语法: datediff(string enddate, string startdate)
返回值: int
说明:返回结束日期减去开始日期的天数。

-- 获取日期间的差值，结果：213
select datediff('2012-12-08','2012-05-09');

日期增加函数: date_add

语法: date_add(string startdate, int days)
返回值: string
说明:返回开始日期startdate增加days天后的日期

-- 获取增加10天后的日期，结果：2012-12-18
select date_add('2012-12-08',10);

日期减少函数: date_sub

注意：使用率比较高，返回的是 String 类型

语法: date_sub (string startdate, int days)
返回值: string
说明:返回开始日期startdate减少days天后的日期。

-- 获取减少10天后的日期，结果：2012-11-28
select date_sub('2012-12-08',10);

months_between 函数

案例一：求两个日期间隔了多少个月

> SELECT months_between('1997-02-28 10:30:00', '1996-10-30');
 3.94959677
 
> SELECT months_between('1997-02-28 10:30:00', '1996-10-30', false);
 3.9495967741935485

案例二：两个日期间隔了几年

-- 注意，大的日期在前，小的日期在后
> SELECT floor(months_between('2018-09-30' , '2008-08-08') / 12);
> 10

add_months 函数

案例一：根据当前日期获取下一个月

> SELECT add_months('2016-08-31', 1);
 2016-09-30

案例二：根据当前日期获取上一个月

> SELECT add_months('2016-08-31', -1);
 2016-07-31

Hive 条件函数

if函数: if

语法: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
返回值: T
说明: 当条件testCondition为TRUE时，返回valueTrue；否则返回valueFalseOrNull

-- 结果：200
select if(1=2,100,200) ;

-- 结果：100
select if(1=1,100,200) ;

条件判断函数：CASE

语法: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
返回值: T
说明：如果a等于b，那么返回c；如果a等于d，那么返回e；否则返回f

-- 结果：mary
select case 100 when 50 then 'tom' when 100 then 'mary'else 'tim' end ;

-- 结果：tim
select case 200 when 50 then 'tom' when 100 then 'mary'else 'tim' end ;

hive的hex函数 hive 函数大全_unix_17

条件判断函数：CASE

语法: CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
返回值: T
说明：如果a为TRUE,则返回b；如果c为TRUE，则返回d；否则返回e

-- 结果：mary
select case when 1=2 then 'tom' when 2=2 then 'mary' else'tim' end ;

-- 结果：tom
select case when 1=1 then 'tom' when 2=2 then 'mary' else'tim' end ;

返回一个非 NULL 的函数

-- 格式
coalesce(expr1,expr2...)

-- 示例，结果:0
select coalesce(NULL,1);

hive的hex函数 hive 函数大全_hive_18

Hive 转换函数

cast()函数

公式：cast(表达式 as 数据类型)
返回值：转换成指定的数据类型
说明：cast函数，可以将"20190607"这样类型的时间数据转化成int类型数据。

-- 转换成日期类型，结果
select cast('2017-06-12' as date) filed;

hive的hex函数 hive 函数大全_unix_19

Hive 行转列

行转列说明

① 行转列是指多行数据转换为一个列的字段。

② Hive 行转列用到的函数

--字段或字符串拼接
concat(str1,str2,...)  

--以分隔符拼接每个字符串
concat_ws(sep, str1,str2)

--将某字段的值进行去重汇总，产生array类型字段
collect_set(col)

行转列测试数据

-- deptno(部门号)，ename(员工姓名)
20,SMITH   
30,ALLEN   
30,WARD    
20,JONES   
30,MARTIN  
30,BLAKE   
10,CLARK   
20,SCOTT   
10,KING    
30,TURNER  
20,ADAMS   
30,JAMES   
20,FORD    
10,MILLER

建表语句

create table emp(
deptno int,
ename string
) row format delimited fields terminated by ',';

插入数据

load data local inpath "/opt/software/emp.txt" into table emp;

开始转换

-- 函数只接受基本数据类型，它的主要作用是将某字段的值进行去重汇总，产生array类型字段
select deptno,concat_ws("|",collect_set(ename)) as ems from emp group by deptno;

hive的hex函数 hive 函数大全_字符串_20

Presto 分组函数（GROUPING SETS、CUBE、ROLLUP）

获取多个分组的统计合并数据

弊端：该查询为多个查询的合集，表的数据被查询 3 次，且合并后的数据显示的有些奇怪

-- 分别按照店铺分组、订单组分组、店铺和订单组分组统计订单销售额，并获取三者的结果集
-- 可以通过 union all 多个 group by 来实现
select store_id,
       null,
       sum(order_amount)
from yp_dwb.dwb_order_detail
group by store_id

union all

select null,
       group_id,
       sum(order_amount)
from yp_dwb.dwb_order_detail
group by group_id

union all

select store_id,
       group_id,
       sum(order_amount)
from yp_dwb.dwb_order_detail
group by store_id, group_id;

hive的hex函数 hive 函数大全_大数据_21

GROUPING SETS

可以实现多个分组结果的合并，且使用 grouping sets、cube、rollup 这些复杂 group 语法的 SQL 只会读取基础数据源一次

select store_id,
       group_id,
       sum(order_amount)
from yp_dwb.dwb_order_detail
group by
    grouping sets (store_id, group_id, (store_id, group_id));

hive的hex函数 hive 函数大全_大数据_22

CUBE

CUBE 操作会生成所提供 column 所有可能的 grouping sets 结果

select store_id,
       group_id,
       sum(order_amount)
from yp_dwb.dwb_order_detail
group by cube(store_id, group_id);

-- 等价
select store_id,
       group_id,
       sum(order_amount)
from yp_dwb.dwb_order_detail
group by
    grouping sets ((store_id, group_id), store_id, group_id, ());

ROLLUP

rollup 操作比较特殊，示例如下：

select store_id,
       group_id,
       sum(order_amount)
from yp_dwb.dwb_order_detail
group by rollup(store_id, group_id);

-- 等价
select store_id,
       group_id,
       sum(order_amount)
from yp_dwb.dwb_order_detail
group by
    grouping sets ((store_id, group_id), (store_id), ());

GROUPING

grouping 返回一个转换为十进制数字的二进制数值集合，指示分组中存在哪些列

它必须与GROUPING SETS，ROLLUP，CUBE或GROUP BY结合使用，并且其参数必须与相应的GROUPING SETS，ROLLUP，CUBE或GROUP BY子句中引用的列完全匹配。

对于给定的分组，如果分组中包含相应的列，则将位设置为0，否则将其设置为1

select group_id,
       store_id,
       sum(goods_amount),
       grouping (group_id)          as ging,
       grouping(store_id)           as sing,
       grouping(group_id, store_id) as g_s_ing
from yp_dwb.dwb_order_detail
group by
    grouping sets (group_id, store_id, (group_id, store_id));

hive的hex函数 hive 函数大全_字符串_23

可以使用grouping操作来判断当前数据是按照哪个字段来分组的

# grouping(group_id)
grouping(group_id)列为0时，可以看到group_id列都是有值的，为1时则相反，证明当前行是按照group_id来进行分组统计的

# grouping(store_id)
grouping(store_id)同理，为0时store_id列有值，为1时store_id为空，证明当前行时按照store_id来进行分组统计的

# grouping(group_id, store_id)
grouping(group_id, store_id)是grouping(group_id)、grouping(store_id)二进制数值组合后转换得到的数字：
a. 按照group_id分组，则group_id=0，store_id=1，组合后为01，二进制转换为十进制得到数字1；
b. 按照store_id分组，则group_id=1，store_id=0，组合后为10，二进制转换为十进制得到数字2；
c. 同时按照group_id和store_id分组，则group_id=0，store_id=0，组合后为00，二进制转换为十进制得到数字0

Hive 数列操作函数

用于操作 Hive 中的复杂类型，例如 Array 和 Map

array_repeat 函数

以数组的形式，返回生成指定个数元素的数组

示例一：官方示例

-- 生成一个数组，数组的元素都为1，长度为10
> select array_repeat(1, 10);
> [1,1,1,1,1,1,1,1,1,1]

-- 生成一个数组，数组的元素都为 'abc' ，长度为 3 
spark-sql> select array_repeat('abc',3);
["abc","abc","abc"]

示例二：explode 和 array_repeat 函数组合，实现行转列

-- 生成一个列，共有 3 行 
spark-sql> select explode(array_repeat('abc',3));
abc
abc
abc
Time taken: 0.092 seconds, Fetched 3 row(s)

示例三：生成一个 id 递增的表

-- 需求：列名为 id ，共有 10 行，每行的值从 0 开始递增，直到第 10 行的 10
spark-sql> select row_number() over (order by id) id from (select explode(array_repeat(1,10)) id );
1
2
3
4
5
6
7
8
9
10
Time taken: 0.553 seconds, Fetched 10 row(s)

element_at 函数

根据下标在 数组 或 map 中获取对应下标的元素

案例一：官方示例

> SELECT element_at(array(1, 2, 3), 2);
 2
 
> SELECT element_at(map(1, 'a', 2, 'b'), 2);
 b

案例二：

有一张表 ratio_A 和一张 ratio_B 表，我们需要 ratio_B 表关联这 ratio_A 表的 ppp 字段， ratio_A 表如下

hive的hex函数 hive 函数大全_unix_24

当 policy 字段的值为 1 时，我们需要使用字段 r1 的值
当 policy 字段的值为 2 时，我们需要使用字段 r2 的值
当 policy 字段的值为 3 时，我们需要使用字段 r3 的值
当 policy 字段的值为 4 时，我们需要使用字段 r4 的值
当 policy 字段的值为 5 时，我们需要使用字段 r5 的值
当 policy 字段的值  >6 时，我们需要使用字段 r6_ 的值

select a.*,
       element_at(array(r1,r2,r3,r4,r5,r6_), least(b.policy_year,6))*ppp_  expense,
from ratio_A a
join ratio_B b on a.ppp = b.ppp

Hive 表生成函数

表生成函数介绍
所有的表生成函数，包括自定义的和内置的，都被称为用户自定义表生成函数 (UDTF) 表生成函数，接受零个或多个输入，然后产生多列或多行输出

stack 函数

该函数可以快速生成小表，可以将一些经常用到的数据快速生成为一张轻量级的小表

示例一：官方示例

-- 生成一张表，表的数据只有 2 行，剩下的数据分成两行
spark-sql> SELECT stack(2, 1, 2, 3);
1       2
3       NULL
Time taken: 0.093 seconds, Fetched 2 row(s)

示例二：快速生成男女性别表

-- 生成一张表，表的数据只有 2 行，列名为 gender，两行数据分别为 'MALE', 'FMALE'
spark-sql> SELECT stack(2, 'MALE', 'FMALE') as gender;
MALE
FMALE
Time taken: 0.093 seconds, Fetched 2 row(s)

## explode 函数介绍

```sql
-- 将hive一列中复杂的array或者map结构拆分成多行。
explode(col)

-- 列表中的每个元素生成一行
explode(ARRAY)

-- map中每个key-value对，生成一行，key为一列，value为一列
explode(MAP)

数据准备

10,CLARK|KING|MILLER
20,SMITH|JONES|SCOTT|ADAMS|FORD
30,ALLEN|WARD|MARTIN|BLAKE|TURNER|JAMES

建表语句

create table emp1(
deptno int,
names array<string>
)
row format delimited fields terminated by ','
collection items terminated by '|';

插入数据

load data local inpath "/opt/software/emp1.txt" into table emp1;

数据查询

select * from emp1;

-- 将所有的员工姓名展示为行
select explode(names) as name from emp1;

hive的hex函数 hive 函数大全_unix_25

hive的hex函数 hive 函数大全_hive_26

LATERAL VIEW侧视图

LATERAL VIEW 函数介绍

-- 用法
 LATERAL VIEW udtf(expression) tableAlias AS columnAlias
 -- 解释
 用于和split, explode等UDTF一起使用，它能够将一列数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合

列转行

select deptno,name from emp lateral view explode(names) tmp_tb as name;

hive的hex函数 hive 函数大全_unix_27

Reflect函数

Reflect 函数介绍

该函数可以支持在 sql 中调用 java 中的自带函数，即：UDF （User-Defined Function）用户自定义函数

案例一：使用 java.lang.Math 当中的Max求两列中最大值

① 建表语句

-- 创建hive表
create table test_udf(col1 int,col2 int) row format delimited fields terminated by ',';

② 准备数据

-- 准备数据 test_udf.txt
1,2
4,3
6,4
7,5
5,6

③ 加载数据

-- 加载数据
load data local inpath '/root/hivedata/test_udf.txt'  into table test_udf;

④ 获取结果

-- 使用java.lang.Math当中的Max求两列当中的最大值
select reflect("java.lang.Math","max",col1,col2) from test_udf;

hive的hex函数 hive 函数大全_字符串_28

案例二：不同记录执行不同的 java 内置函数

① 建表语句

--创建hive表
create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ',';

② 准备数据

-- 准备数据 test_udf2.txt
java.lang.Math,min,1,2
java.lang.Math,max,2,3

③ 加载数据

-- 加载数据
load data local inpath '/opt/software/test_udf2.txt' into table test_udf2;

④ 获取结果

-- 执行查询
select reflect(class_name,method_name,col1,col2) from test_udf2;

hive的hex函数 hive 函数大全_大数据_29

Hive 开窗函数

窗口函数(1) NTILE,ROW_NUMBER,RANK,DENSE_RANK

函数的区别

函数	区别
NTILE
ROW_NUMBER	同一个分区，如果数据的排序相同，都会保留，数据依次递增/递减
RANK	同一个分区，如果数据的排序相同，都会保留，但是会留下空位
DENSE_RANK	同一个分区，如果数据的排序相同，都会保留，但是不会留下空位

数据准备

cookie1,2018-04-10,1
cookie1,2018-04-11,5
cookie1,2018-04-12,7
cookie1,2018-04-13,3
cookie1,2018-04-14,2
cookie1,2018-04-15,4
cookie1,2018-04-16,4
cookie2,2018-04-10,2
cookie2,2018-04-11,3
cookie2,2018-04-12,5
cookie2,2018-04-13,6
cookie2,2018-04-14,3
cookie2,2018-04-15,9
cookie2,2018-04-16,7

建表语句

CREATE TABLE net_his (
cookieid string,
createtime string,   --day 
pv INT
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
stored as textfile;

加载数据

-- 加载数据：
load data local inpath '/opt/software/net_his.dat' into table net_his;

ROW_NUMBER 函数

根据窗口分区中行的顺序，为每一行返回一个唯一的顺序号（从1开始），它不能用做于学生成绩的排名等需求，一般多用于分页查询，比如求 TopN

根据 ROW_NUMBER 函数的特点，我们可以实现去重功能，如果 PARTITION BY 的字段内容相同，那 rn 的值就会开始排列，此时我们直接取 rn=1 就可以

-- 按照 cookied 分组，然后对分组后的数据各自按照 pv 进行排序，排序后的名次即为 rn
SELECT 
  cookieid,
  createtime,
  pv,
  ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn 
FROM net_his;

-- 不分组，只排序，并显示出排序后的序号
SELECT 
  cookieid,
  createtime,
  pv,
  ROW_NUMBER() OVER(ORDER BY pv desc) AS rn 
FROM net_his;

-- 获取每个店铺订单金额 TOP10 的订单
-- 按照 store_id 分组，并对分组后的数据各自降序排序
select *
from (
         select order_amount, store_id,
                ROW_NUMBER() OVER(partition by store_id order by order_amount desc) rn
         from yp_dwb.dwb_order_detail
     )
where rn <=10;

hive的hex函数 hive 函数大全_字符串_30

RANK 和 DENSE_RANK 函数

SELECT 
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 
FROM net_his 
WHERE cookieid = 'cookie1';

hive的hex函数 hive 函数大全_hive_31

Hive分析窗口函数(2) SUM,AVG,MIN,MAX

数据准备

cookie1,2018-04-10,1
cookie1,2018-04-11,5
cookie1,2018-04-12,7
cookie1,2018-04-13,3
cookie1,2018-04-14,2
cookie1,2018-04-15,4
cookie1,2018-04-16,4

建表语句

--建表语句:
create table net_his1(
cookieid string,
createtime string,   --day 
pv int
) row format delimited 
fields terminated by ',';

加载数据

--加载数据：
load data local inpath '/opt/software/net_his1.dat' into table net_his1;

开启智能本地模式

-- 本地模式：对数据量比较小的操作，就可以在本地执行，这样要比提交任务到集群执行效率要快很多
SET hive.exec.mode.local.auto=true;

SUM（结果和ORDER BY相关,默认为升序）

-- 如果不指定rows between,默认为从起点到当前行;
-- 如果不指定order by，则将分组内所有值累加;
-- 关键是理解rows between含义,也叫做window子句：
  -- preceding：往前
  -- following：往后
  -- current row：当前行
  -- unbounded：起点
  -- unbounded preceding 表示从前面的起点
  -- unbounded following：表示到后面的终点

--pv1: 分组内从起点到当前行的pv累积，如，11号的pv1=10号的pv+11号的pv, 12号=10号+11号+12号
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime) as pv1 
from net_his1;

--pv2: 同pv1
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2
from net_his1;

hive的hex函数 hive 函数大全_字符串_32

--pv3: 分组内(cookie1)所有的pv累加
--如果每天 order by 排序语句，默认把分组内的所有数据进行sum操作
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid) as pv3
from net_his1;

hive的hex函数 hive 函数大全_Hive_33

--pv4: 分组内当前行+往前3行，如，11号=10号+11号， 12号=10号+11号+12号，
	                       13号=10号+11号+12号+13号， 14号=11号+12号+13号+14号
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv4
from net_his1;

hive的hex函数 hive 函数大全_unix_34

--pv5: 分组内当前行+往前3行+往后1行，如，14号=11号+12号+13号+14号+15号=5+7+3+2+4=21
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5
from net_his1;

hive的hex函数 hive 函数大全_unix_35

--pv6: 分组内当前行+往后所有行，如，13号=13号+14号+15号+16号=3+2+4+4=13，
							 14号=14号+15号+16号=2+4+4=10
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6
from net_his1;

hive的hex函数 hive 函数大全_Hive_36

AVG，MIN，MAX 函数，和 SUM 用法相同

-- 计算到当前天的平均 pv 量
select cookieid,createtime,pv,
avg(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2
from net_his1;

hive的hex函数 hive 函数大全_大数据_37

-- 计算到当前天的最大 pv 量
select cookieid,createtime,pv,
max(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2
from net_his1;

hive的hex函数 hive 函数大全_unix_38

-- 计算到当前天的最小 pv 量
select cookieid,createtime,pv,
min(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2
from net_his1;

hive的hex函数 hive 函数大全_Hive_39

Hive分析窗口函数(3) LAG,LEAD,FIRST_VALUE,LAST_VALUE

注意：如果不指定ORDER BY，则进行排序混乱，会出现错误的结果

数据准备

① 准备数据

cookie1,2018-04-10 10:00:02,url2
cookie1,2018-04-10 10:00:00,url1
cookie1,2018-04-10 10:03:04,1url3
cookie1,2018-04-10 10:50:05,url6
cookie1,2018-04-10 11:00:00,url7
cookie1,2018-04-10 10:10:00,url4
cookie1,2018-04-10 10:50:01,url5
cookie2,2018-04-10 10:00:02,url22
cookie2,2018-04-10 10:00:00,url11
cookie2,2018-04-10 10:03:04,1url33
cookie2,2018-04-10 10:50:05,url66
cookie2,2018-04-10 11:00:00,url77
cookie2,2018-04-10 10:10:00,url44
cookie2,2018-04-10 10:50:01,url55

② 建表语句

CREATE TABLE net_his2 (
cookieid string,
createtime string,  --页面访问时间
url STRING       --被访问页面
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
stored as textfile;

③ 加载数据

--加载数据：
load data local inpath '/opt/software/net_his2.dat' into table net_his2;

LAG 函数介绍

返回本行往前数第 offset 行的 字段1 ，如果它不存在则返回指定的默认值 default

-- 语法
-- 注意：offset 默认是1，default 默认是null，这2个可以省略
lag(字段1 [, offset[, default]]) over(partition by 字段2 order by 字段3 )

使用示例

-- 创建一个临时表
> cache table lag_test select stack(5,1,2,3,4,5,6,7,8,9,10);
  1       2
  3       4
  5       6
  7       8
  9       10
-- 使用 lag 函数在计算 clo0 列的值为上一行的结果 + 1，并把它命名为新的列 col2
> select *,if(col0 > 1,lag(col0 + 1) over (order by col0),col0) col2 from lag_test;
1       2       1
3       4       2
5       6       4
7       8       6
9       10      8

LEAD 函数介绍

与 LAG 相反 LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值第一个参数为列名，第二个参数为往下第n行（可选，默认为1），第三个参数为默认值（当往下第n行为 NULL 时候，取默认值，如不指定，则为 NULL ）

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time 
FROM net_his2;

hive的hex函数 hive 函数大全_unix_40

FIRST_VALUE 函数

取分组内排序后，截止到当前行，第一个值

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1 
FROM net_his2;

hive的hex函数 hive 函数大全_hive_41

LAST_VALUE 函数

-- 取分组内排序后，截止到当前行，最后一个值
SELECT cookieid,
 createtime,
 url,
 ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
 LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1 
FROM net_his2;

-- 如果想要取分组内排序后最后一个值，则需要变通一下
SELECT cookieid,
  createtime,
  url,
  ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
  LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
  FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2 
FROM net_his2 
ORDER BY cookieid,createtime;

hive的hex函数 hive 函数大全_Hive_42

hive的hex函数 hive 函数大全_大数据_43

窗口累加：order by 的不同效果

总结：在每个分组中，使用 order by 可以出现累加的效果

-- 求 （ 当前年度和后续年度的 ci_cx_ * db1 的和 ） / dx
select 
*, 
(sum(ci_cx_*db1) over (partition by age_buy,ppp,sex order by policy_year desc))/dx pvdb1 
from rsv_src0  ;

① 有 order by 且 order by 的字段不是用于分组的字段

分组后按照年排序，例如：1，2，3，4，5，6…如果要计算每年与之后的年的累加值，使用 desc 则可以完成累加的效果，例如: …6,5,4,3,21，使用order by desc 后，正好就是我们要从当前年朝后加的效果

-- 求 （ 当前年度和后续年度的 ci_cx_ * db1 的和 ） / dx
select 
*, 
(sum(ci_cx_*db1) over (partition by age_buy,ppp,sex order by policy_year desc))/dx pvdb1 
from rsv_src0  ;

② 有 order by 且 order by 的字段是用于分组的字段

此时会将整个 partition 的数据相加

③ 有 partition by 无 order by

此时会将整个 partition 的数据相加，和 ② 中的效果相同

多个分组的数据聚合：窗口函数 + 分组

使用开窗函数对每个分组的和求和

注意：如果相对各个分组后的数据进行聚合的话，就需要使用 over() ，而且必须得对所有数据开窗

select p.insur_code,
       sum(if(c.claim_item like '%sgbxj%', 1, 0)) sg_cnt,
       sum(count(1)) over ()  all_cnt
group by p.insur_code,p.insur_name, age;

聚合函数中的判断条件写法

like 语法
我们在分组中统计某个字段的值时，需要模糊匹配

-- 将符合模糊匹配的字段的数量求和
sum(if(c.claim_item like '%sgbxj%', 1, 0))

rows between 设置窗口范围

使用介绍

可以根据自己的需求任意的控制窗口函数的范围

① UNBOUNDED ：不受控的，无限的

② PRECEDING ：在…之前

③ FOLLOWING：在…之后

-- 组合示例
-- 前面所有行
unbounded preceding
-- 后面所有行
unbounded following 
-- 当前行
current row
-- 后面n行
n following
-- 前面n行
n preceding

使用示例

-- 按照商品类型分区，按照支付费用排序，计算当前行后的(下一行)所有订单的价格，当前行除外
sum(order_pay) over (partition by type order by order_pay rows between 1 following and unbounded following )

Hive 自定义函数

自定义 UDF 说明

Hive 自带了一些函数，比如：max/min 等，但是数量有限，自己可以通过自定义 UDF 来方便的扩展，当Hive提供的内置函数无法满足你的业务处理需要时，此时就可以考虑使用用户自定义函数（UDF：user-defined function）

自定义 UDF 的类型

① UDF（User-Defined-Function）：一进一出

② UDAF（User-Defined Aggregation Function）：聚集函数，多进一出，如：count/max/min

③ UDTF（User-Defined Table-Generating Functions）：一进多出，如：lateral view explore()

自定义函数流程

① 创建 Maven 工程，导入依赖

<dependencies>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>2.1.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.7.5</version>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.0</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>
    </plugins>
</build>

② 继承 UDF ，并重载 evaluate 方法

package org.example.hive.function.udf;

import java.util.Objects;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.BooleanWritable;
import org.apache.hadoop.io.Text;

/**
 * @author Kyle White
 * @date Jun 19,2021
 */
public class HasTextFunction extends UDF {

    /**
     * 判断字符串是否为空
     *
     * @param text 字符串
     * @return 空:false 反之:true
     */
    public BooleanWritable evaluate(final Text text) {

        BooleanWritable result = new BooleanWritable(false);

        if (Objects.isNull(text) || StringUtils.isEmpty(text.toString())) {
            return result;
        }

        result.set(true);

        return result;
    }
}

③ 将项目打包为 jar ，并上传到 hive 的 lib 目录下

# 进入 lib 目录
cd /opt/software/hive-2.1.0/lib

# jar 包名字太长，我重命名下（可选操作）
mv ./hive-function-1.0.0-SNAPSHOT.jar ./hive-function.jar

④ hive 的客户端添加我们的 jar 包

add jar /opt/software/hive-2.1.0/lib/hive-function.jar;

hive的hex函数 hive 函数大全_hive_44

⑤ 设置函数与我们的自定义函数关联

create temporary function has_text as 'org.example.hive.function.udf.HasTextFunction';

hive的hex函数 hive 函数大全_hive_45

⑥ 使用自定义函数

select has_text('');

hive的hex函数 hive 函数大全_unix_46

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：switch java 字符串 switch case java 字符串

下一篇：java 获取用户关闭页面 java获取useragent

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯