hive median函数会剔除空值吗 hive删除函数

转载

mob64ca140d2323 2024-06-10 12:29:43

文章标签 hive median函数会剔除空值吗 hive hadoop 数据仓库数据 文章分类 Hive 大数据

Hive

函数

概述

1，Hive作为数据分析工具，提供了丰富的函数

2，通过show functions;查看所有函数

3，Hive中函数不能单独使用，必须结合select或者其他关键字构成语句

4，Hive中还支持自定义函数，包括定义临时函数和永久函数

入门案例

--案例一：给定日期，提取年份
--使用year函数
select year('2024-04-05');

--案例二：给定日期提取年份
-- year函数在使用的时候，，要求年月日之间必须用-间隔
select year(regexp_replace('2024/04/24','/','-');

--或者
select cast(split('2024/04/24','/')[0] as int);

常用函数

nvl

1，nvl（s1，s2）：如果s1不为null就返回s1，如果s1为null就返回s2，如果都为null就返回null

2，案例

-- 原始数据
1 Billy 800
2 Peter 1000
3 Balley
4 Danny 600
5 Vincent 500
6 David 400
7 Thomas
8 Tony 800
9 Jane 500
10 John 600

-- 建表
create table rewards (
    id     int,
    name   string,
    reward double
) row format delimited
    fields terminated by ' ';

-- 加载数据
load data local inpath '/opt/hive_demo/rewards' into table rewards;

-- 查询数据
select * from rewards tablesample (5 rows);

-- 计算平均每一个人的奖金是多少
-- avg是聚合函数，聚合函数在计算的时候会自动的跳过null值
select avg(reward) from rewards;

-- 虽然有2人没有发到奖金，也应该属于被平均的基数
select avg(if(reward is not null, reward, 0)) from rewards;

-- 或者
select avg(nvl(reward, 0)) from rewards;

case when

1，case when类似Java中的switch case结构，用于对数据进行分支或者判断

2，案例

-- 原始数据
1 财务部 bill 男
2 技术部 lily 女
3 技术部 gate 男
4 财务部 lucy 女
5 财务部 hack 男
6 财务部 john 男
7 技术部 alex 男
8 技术部 jane 女
9 技术部 jack 男
10 技术部 mark 男
11 财务部 mike 男
12 技术部 tony 男
-- 建表
create table employers
(
    id         int,
    department string,
    name       string,
    gender     string
) row format delimited
    fields terminated by ' ';

-- 加载数据
load data local inpath '/opt/hive_demo/employers' into table employers;

-- 查询数据
select *
from employers tablesample (3 rows);

-- 需求：统计每一个部门中男生和女生的人数
-- 统计结果不好看，当部门比较多的时候，这个结构看起来就会比较混乱
select department, gender, count(*)
from employers
group by department, gender;

-- sum-if方式
select department                   as `部门`,
       sum(if(gender = '男', 1, 0)) as `男生人数`,
       sum(if(gender = '女', 1, 0)) as `女生人数`
from employers
group by department;

-- if可以用case-when代替
select department                                   as `部门`,
       sum(case gender when '男' then 1 else 0 end) as `男生人数`,
       sum(case gender when '女' then 1 else 0 end) as `女生人数`
from employers
group by department;

exploded

1，explode函数，在使用时，需要传入一个数组或者映射

如果传入的时数组，那么会将数组中的每个元素拆分出来，形成单独一行数据

如果传入的时映射，那么会将映射中的键值拆分成两列数据

2，案例

-- 统计每一个单词出现的次数
-- 建表
create table words (
    line array<string>
) row format delimited
    collection items terminated by ' ';

-- 加载数据
load data local inpath '/opt/hive_demo/words.txt' into table words;

-- 查询数据
select * from words;

-- 先将数组中的每一个元素拆分出来，形成单独的一行
select explode(line) as w from words;

-- 统计xxx出现的次数，语法结构：
-- select xxx, count(xxx) from tableName group by xxx;

-- 统计每一个单词出现的次数，在上一次拆分的基础上来统计
-- 将explode(line)的结果起别名为w
-- 将子查询的结果放入临时表ws中
select w, count(w)
from (
  select explode(line) as w from words
) ws group by w;

列转行

1，将一列数据进行拆分，拆分成多行数据，必然使用explode

2，案例

-- 原始数据
哈尔的移动城堡	爱情/动画/奇幻/冒险
你想活出怎样的人生	动画/奇幻/冒险
大“反”派	剧情/喜剧
沙丘2	剧情/动作/科幻/冒险
白日之下	剧情/犯罪
破墓	悬疑/惊悚/恐怖
第二十条	剧情/喜剧/家庭
-- 建表
create table movies (
    name  string,
    kinds array<string>
) row format delimited
    fields terminated by '\t'
    collection items terminated by '/';

-- 加载数据
load data local inpath '/opt/hive_demo/movies' into table movies;

-- 查询数据
select * from movies;

-- 需求：查询所有的喜剧片
-- 语法： lateral view functionName(expression) tableAlias as columnAlias
-- 将一列数据拆分成多行，这个过程称之为"炸列"
select name, k
from movies lateral view explode(kinds) ks as k
where k = '喜剧';

3，案例

-- 原始数据
bill 开朗,活泼 打篮球,打游戏
lucy 幽默,开朗 打游戏,看电影,听音乐
peter 活泼,外向 打篮球,看电影,旅游
david 活泼,幽默 打游戏,听音乐
-- 建表
drop table if exists students;
create table students (
    name       string,
    characters array<string>,
    hobbies    array<string>
) row format delimited
    fields terminated by ' '
    collection items terminated by ',';

-- 加载数据
load data local inpath '/opt/hive_demo/students' into table students;

-- 查询数据
select * from students;

-- 查询性格活泼，喜欢打游戏的学生
select name, c, h
from students
         lateral view explode(characters) cs as c
         lateral view explode(hobbies) hs as h
where c = '活泼' and h = '打游戏';

行转列

1，将多行数据汇聚到同一列上，一般使用collect_list 或者collect_set，这俩函数将数据聚合成一个数组返回。不同的是，collect_list允许元素重复，collect_set不允许元素重复；如果需要将数据聚合成struct结构，那么需要使用named_struct函数

2，案例

-- 原始数据
1 1 burt
1 1 carl
1 1 perl
1 2 cindy
1 2 kathy
1 2 david
1 3 grace
1 3 henry
1 3 wendy
2 1 evan
2 1 john
2 1 jack
2 2 mark
2 2 mike
2 2 lucy
-- 建表
drop table if exists students;
create table students (
    grade int,
    class int,
    name  string
) row format delimited
    fields terminated by ' ';

-- 加载数据
load data local inpath '/opt/hive_demo/students' into table students;

-- 查询数据
select * from students;

-- 需求：将同年级同班级的学生放到一起
select grade, class, collect_list(name)
from students
group by grade, class;

-- 元素拼接
select grade                              as `年级`,
       class                              as `班级`,
       concat_ws(',', collect_list(name)) as `学生`
from students
group by grade, class;

窗口函数

概述

1，窗口函数又称开窗函数，用于限制数据处理范围

2，基本语法结构

分析函数 over（partition by 字段 order by 字段 [desc/asc] rows between 起始范围 and 结束范围）

over表示使用窗口来进行限制
partition by对数据进行分类
order by对数据进行排序
rows between 起始范围 and 结束范围：指定数据的处理范围

关键字	解释
preceding	往前
following	往后
unbounded	无边界
current row	当前行

3，分析函数：用于对每一行数据进行处理分析的函数，可以分为三类：

聚合函数，例如min、max、sum、avg等
移位函数，包含lag、lead和ntil

lag(col, n)：表示基于这一列的当前行，处理第前n行的数据
lead(col, n)：表示基于这一列的当前行，处理第后n行的数据
ntil(n)：要求将数据排序之后，平均放入n个桶中

排序函数，包含row_number、rank和dense_rank

案例

--原始数据
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

-- 建表
drop table if exists orders;
create table orders (
    name       string,
    order_date string,
    cost       double
) row format delimited
    fields terminated by ',';

-- 加载数据
load data local inpath '/opt/hive_demo/orders' into table orders;

-- 查询数据
select * from orders;

--需求一 ：查询每一位顾客的消费明细以及到消费日期为止的总计消费金额
-- 思路：
-- 1. 要获取每一位顾客的明细，所以需要根据顾客来分组，每一位顾客就是一组数据
-- 2. 需要获取到消费日期为止的金额，那么需要根据日期来排序
-- 3. 到消费日期为止，那么就意味着，是从开始消费，到当前处理行的数据求和
select *,
       sum(cost) over (partition by name order by order_date rows between unbounded preceding and current row ) as total
from orders;

--需求二 ：查询每一位顾客的消费明细，以及上一次的购买时间
-- 思路：
-- 1. 要获取每一位顾客的信息，所以需要根据顾客分组
-- 2. 要获取顾客上一次的购买日期，那么就需要先将日期来排序
-- 3. 上一次的购买时间，那么需要获取的就是当前日期的上一行数据
select *,
       lag(order_date, 1) over (partition by name order by order_date) as last_order_date
from orders;

--需求三 ：获取最早进店消费的前20%的顾客名单
-- 思路：
-- 1. 按照日期来进行排序
-- 2. 20% = 1/5，所以需要将数据分成5给个桶
-- 注意：窗口函数的结果不能直接用where过滤
select name
from (
  select *,
         ntile(5) over (order by order_date) as n
  from orders
) t where n = 1;

--需求四 ：获取连续进店消费三天的顾客名单
--思路：
-- 1. 先将每一位顾客的消费明细分开 - 按照顾客名单进行分类
-- 2. 需要按照消费日期，对每一位顾客的消费明细进行排序
-- 3. 如果产生了连续的三天消费，那么意味着消费日期之差一定是2
select name
from (
    select name, order_date,
           lag(order_date, 2) over (partition by name order by order_date) as last_date
    from orders
) t where datediff(order_date, last_date) = 2;

排序函数

row_number()：无论数据是否相同，都会顺次排序
rank()：如果数据相同，则排序相同，但是会产生空位
dense_rank()：如果数据相同，则排序相同，但是不会产生空位

案例

--原始数据
alex chinese 75
bob chinese 76
cindy chinese 75
david chinese 85
eden chinese 85
grace chinese 84
alex math 75
bob math 76
cindy math 75
david math 85
eden math 85
grace math 84
alex english 75
bob english 76
cindy english 75
david english 85
eden english 85
grace english 84

-- 建表
drop table if exists scores;
create table scores (
    name    string,
    subject string,
    score   int
) row format delimited
    fields terminated by ' ';

-- 加载数据
load data local inpath '/opt/hive_demo/scores' into table scores;

-- 查询数据
select * from scores;

--需求一：排序
-- 排序
select *,
       row_number() over (partition by subject order by score desc)   as `row_number`,
       rank() over (partition by subject order by score desc)         as `rank`,
       dense_rank() over ( partition by subject order by score desc ) as `dense_rank`
from scores;

--需求二 ：获取各科前三名
select *
from (
   select *, rank() over (partition by subject order by score desc) as n 
   from scores
) t where n <= 3;

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。