hive包含语句 hiveload语句

转载

数据挖掘者 2023-09-07 10:50:51

文章标签 hive包含语句 hive 大数据数据仓库数据 文章分类 Hive 大数据

我们在前面的文章中介绍了Hive的基础语法和高阶查询，本篇将继续带大家学习Hive的高阶语句。

装载数据：Load

我们在前面的文章中也有使用过load的案例，这里再详细给大家介绍一下。Load用于在Hive中移动数据，可以实现数据的导入，上传。
语法格式：
load data [local] inpath ‘路径’ insert [overwrite] into table 表名 [partition()]
local表示文件地址在本地，如果不加则表名是从HDFS上进行传输。overwrite表示覆盖现有数据，如果不加这是追加。
我们以下面的employee数据为例，分别是姓名、地区、个人信息、科目得分、工种

Michael|Montreal,Toronto|Male,30|DB:80|Product:DeveloperLead
Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead Shelley|New
York|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89|Sales:Lead

create table employee(
name string,
address array<string>,
personnalInfo array<string>,
technol map<string,int>,
jobs map<string,string>)
row format delimited
fields terminated by '|' 
collection items terminated by ',' 
map keys terminated by ':' 
lines terminated by '\n';

load data local inpath '/opt/employee.txt'
overwrite into table employee;

用select查看结果

hive包含语句 hiveload语句_hive包含语句

在用追加的方式：

load data local inpath '/opt/employee.txt'
into table employee;

数据在原有的基础上又增加了四行

hive包含语句 hiveload语句_数据仓库_02

load也支持分区时插入，这里就不再演示。

insert插入

当我们想实现把一个表中的数据直接插入到另一张表的时候，就可以用的本节所介绍的内容，当然，两个表对应的结构要一致。insert插入也支持overwrite和into。insert的用法还是挺广泛的，下面我们一一来介绍。

多插入

格式：
from 调用数据表的表名
insert overwrite/into [table]插入的表1 select *
insert overwrite/into [table]插入的表2 select *
…
需要注意的是，当我们使用overwrite时，table不能省略，使用into时可以省略。
我们还是以上面的数据为例。这里我直接复制了employee表，建立了两个新表firstEmp和secondEmp.然后把employee表中的数据导入到这两个表中：

from employee
insert overwrite table firstEmp select *
insert into secondEMp select *;

查询结果：

hive包含语句 hiveload语句_hive_03

插入到分区

格式：
*from 调用数据表的表名
insert overwrite/into [table]插入的表 partition(分区名 [类型])
select ,‘分区名’;
我们先对上面的数据做个分区表:

create table employee_partition(
name string,
address array<string>,
personnalInfo array<string>,
technol map<string,int>,
jobs map<string,string>)
partitioned by (year string,month int)
row format delimited
fields terminated by '|' 
collection items terminated by ',' 
map keys terminated by ':' 
lines terminated by '\n';

开启分区：

#开启动态分区，默认是false
set hive.exec.dynamic.partition=true;
#开启允许所有分区都是动态的，否则必须要有静态分区才能使用。
set hive.exec.dynamic.partition.mode=nonstrict;

把数据插入到分区：

from employee
insert overwrite table employee_partition partition(year,month)
select * ,'2020',9;

插入或导出到文件

文件插入只支持overwrite.
把表格中的数据插入到本地文件：
格式：
*from 表名 insert overwrite local directory ‘本地路径’ select ;

from employee
insert overwrite local directory '/opt/b.txt'select *;

执行完成后，会在本地创建一个文件夹，数据内容就在这个文件夹下的文件里。即使最后路径命名为.txt也不会自动创建一个文件：

hive包含语句 hiveload语句_hive包含语句_04

数据上传到HDFS：

格式：

*from 表名 insert overwrite directory ‘HDFS路径’ select ;

from employee
insert overwrite directory '/data/outer'select *;

执行结果：

hive包含语句 hiveload语句_数据_05

导出到文件：

*from 表名1 insert overwrite table 表名2 select ;

from employee
insert overwrite table newEmp select *;

插入完成后查看表信息：

hive包含语句 hiveload语句_数据_06

指定格式插入数据：

insert overwrite directory '/data/outer2'
row format delimited
fields terminated by ','
select * from employee;

执行结果：

hive包含语句 hiveload语句_hive包含语句_07

数据交换

数据交换包括Import数据导入和export数据导出，常用于数据迁移，除了数据库，可导入导出所有数据和元数据。
导出数据：
数据是导出到HDFS上

export table employee to '/data/emp';

导出数据后，会在目录下生产一个_metadata文件和data文件夹，而我们copy的数据就在data文件夹下。

hive包含语句 hiveload语句_数据仓库_08

导入数据：

导入数据就是把我们导入到HDFS上的数据导入到一张新表中，文件新表再导入前是不存在的，无法从本地导入数据到一张表中：

import table e from '/data/emp';

查询结果：

hive包含语句 hiveload语句_hive包含语句_09

hive包含语句 hiveload语句_数据_10

数据交换也可以实现分区的导入和导出，例如：

export table employee_partition partition(year='2021',month=9) to '/data/partition';

数据排序

Hive的数据排序主要有四种，order by、sort by 、distribute by和cluster by。
order by

Hive中的order by类似于标准SQL，对全局进行排序。我们以下列数据为例，这是一个employye_hr表：

McGrirl|1|945-639-8596|2011-11-24 Gabriela
Feldheim|2|706-232-4166|2017-12-16 Billy
O’Driscoll|3|660-841-7326|2017-02-17 Kevina
Rawet|4|955-643-0317|2012-01-05 Patty
Entreis|5|571-792-2285|2013-06-11 Claudetta
Sanderson|6|350-766-4559|2016-11-04 Bentley
Oddie|7|446-519-0975|2016-05-02 Theressa
Dowker|8|864-330-9976|2012-09-26 Jenica
Belcham|9|347-248-4379|2011-05-02 Reube
Preskett|10|918-740-2357|2015-03-26 Mary
Skeldon|11|361-159-8710|2016-03-09 Ethelred
Divisek|12|995-145-7392|2016-10-18 Field
McGraith|13|149-133-9607|2015-10-06 Andeee
Wiskar|14|315-207-5431|2012-05-10 Lloyd
Nayshe|15|366-495-5398|2014-06-28 Mike
Luipold|16|692-803-9373|2011-05-14 Tallie
Swaine|17|570-709-6561|2011-08-06 Worth
Ledbetter|18|905-586-2348|2012-09-25 Reine
Leyborne|19|322-644-5798|2015-01-05 Norby
Bellson|20|736-881-5785|2012-12-31 Nellie
Jewar|21|551-505-3957|2017-06-18 Hoebart
Deeth|22|780-240-0213|2011-09-19

创建表：

create  table emp_hr(
name string,
id int,
iphone string,
date string)
partitioned by (year int,month int)
row format delimited
fields terminated by '|'
lines terminated by '\n';
stored as textfile
location '/tmp/data/employee';

我们以id列做排序，order by 默认是正序排序，如果想要倒叙需要在order by id后加desc，即 order by id desc;

select * from employee_hr order by id;

排序结果：

hive包含语句 hiveload语句_数据_11

order by的特点是只使用一个Reduce执行全局数据排序，它还支持使用case when或表达式，支持按位置编号排序。

例如我们想把id为15的人排在第一位，其他不变就可以用下列代码实现：

set hive.groupby.orderby.position.alias=true;

select * from employee_hr order by case when id=15 then 1 else 0 end desc;

按照倒叙排序，把id为15的人放第一位

hive包含语句 hiveload语句_数据仓库_12

因为order by是对全局进行排序，所以它的速度较慢，在做排序前，应提前做好数据过滤。

sort by

sort by和order by最大的不同在于，它是局部排序，而且排序列必须在select 的列表中。

select name,id,iphone from employee_hr sort by id;

hive包含语句 hiveload语句_hive包含语句_13

如果select的列表中不包含id则会报错：

hive包含语句 hiveload语句_大数据_14

distribute by

distribute by类似于SQL中的group by

不过group by 可以使用聚合函数，distribute by是按照一定的分组把数据分到不同的reduce中。我们直到，当数据进入到Reduce端的时候是有顺序的，但是ditribute by本身并不会对各个Reduce的输出进行排序所以其一般和sort by结合使用，用在sort by前。

我们用下面的数据来做个演示：

id course score
2016001 PE 91
2016001 YY 60
2016001 SX 88
2016001 YW 91
2016002 SX 77
2016002 YW 33

select * from score distribute by id sort by scores ;

hive包含语句 hiveload语句_hive包含语句_15

cluster by
cluster by=distribyte by+sort by。

select * from score cluster by scores;

hive包含语句 hiveload语句_大数据_16

可以看到执行结果和用cluster by、sort by的结果一样。

注意：cluster by 不支持desc/asc，为了充分利用所有的Reducer来执行全局程序，可以先使用cluster by，然后使用order by.

区别

sort by ：局部排序，在进入Reduce前完成排序，只在单节点上排序
order by：全局排序，只有一个reduce；
distribute by:按照分组把数据分到不同的reduce中，通常使用的sort by语句前；
cluster by:除了distribute by的功能外，还加入了sort by局部排序，分组的同时排序；

聚合运算

Group by

group by用于分组，常与聚合函数一起使用，如max,min,count,sum等。如果没有指定group by子句则默认聚合整个表。除聚合函数外，所选的其他列也必须包含在group by中。同样，group by也支持case when语句和按位置编号分组。
还是以上面的数据为例：

select id,count(1) from score group by id;

以id为分组，就可以查出每个id各有多少条数据

hive包含语句 hiveload语句_大数据_17

使用case when表达式：

select case when id='2016001' then 1 else 0 end as ids,max(scores)
from score group by id;

把id为2016001的表示为1，其他的为2，求出他们各自所得分数的最大值：

hive包含语句 hiveload语句_大数据_18

Having

Having 的作用是对聚合的结果条件过滤。它和where不同。where是在语句执行完后，对最终的结果进行条件筛选。

以上面的数据为列，我们要想实现筛选出考试科目大于2门的学生id：

select id,count(1)cnt from score group by id having cnt>2;

在查询完数据后，筛选出数据条数大于2的。

hive包含语句 hiveload语句_数据仓库_19

高级聚合

Grouping sets 实现对同一数据集进行多重group by操作，其本质是多个group by 进行union all 操作：

hive包含语句 hiveload语句_数据_20

例如：

select id,course,sum(scores) from score
group by id,course
grouping sets((id,course),id);

结果为按id和科目分组，计算出各个科目的总分，再加上以id分组的总分数。

hive包含语句 hiveload语句_hive包含语句_21

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：actn架构 tnga架构

下一篇：kubeflow kubernetes 版本 kubernetes workload

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯