Hive内部表如何删除 hive 删除外部表

转载

技术博主 2023-11-06 12:24:52

文章标签 Hive内部表如何删除 hive 数据分区表 文章分类 Hive 大数据

一、表的创建（外部表和内部表）

1、内部表与外部表的区别

外部表：hive中删除外部表时，数据不会被删除，即hive不会控制外部表数据的生命周期

内部表（管理表）：hive中删除内部表时，数据会被删除掉，即hive会控制内部表数据的生命周期。

查询表的详情

desc formatted student5;

删除表：drop table student;

对于一个公司来说，数据非常重要，原始数据会放在外部表还是管理表中？外部表

2、创建表的方式（三种）

方式一：（普通创建表）（最常用的一种）

create [external] table student(

id int,

name string

)

row format delimited fields terminated by '\t'

stored as textfile //存储类型

location '/user/zkpk/input'; //存储位置

方式二：（根据查询结果创建表）关键字：as select

create table student2 as select id,name from student;

方式三：（复制表的结构，不复制数据）关键字：like

create table student3 like student;

二、分区表

1、创建分区表,关键字：partitioned by

语法：create table student(

id int,

name string

)

partitioned by(month string)

row format delimited fields terminated by '\t';

2、分区表中导入数据

load data local inpath '/home/zkpk/datas/student.txt' into table student partition(month='202203');

3、查询分区表

select * from student where month='202203';

联合查询（查询二月份和三月份的数据）关键字：union

select * fron student where month='202202'

union select * from student where month='202203';

4、创建分区表的意义：减少查询数据，提高查询效率。

2022年3月3日内容：

一、分区表的操作

5、修改分区表（增加分区），关键字：alter...add

增加一个分区：

alter table dept_partition add partition(month='202204');

增加多个分区，分区间用空格隔开

alter table dept_partition add partition(month='202205') partition(month='202206');

6、删除分区，关键字：alter...drop

删除一个分区

alter table dept_partition drop partition(month='202206');

删除多个分区，删除的分区间用逗号隔开

alter table dept_partition drop partition(month='202204'),partition(month='202205');

7、查询分区

show partitions dept_partition;

二、创建二级分区：

create table dept_partition2(

deptno int,

dname string,

loc string

)

partitioned by(month string,day string)

row format delimited fields terminated by '\t';

二级分区表中导入数据

load data local inpath '/home/zkpk/datas/dept.txt' into table dept_partition2

partition(month='202203',day='02');

三、分区表关联数据（三种）

方式一：上传数据后修复分区 msck repair table +表名

1、创建目录

hive(default)>dfs -mkdir -p /user/hive/warehouse/student2/month=202203/day=03;

2、hive中创建表student2

hive(default)>create table student2(

id int,

name string

)

partitioned by(month string,day string)

row format delimited fields terminated by '\t';

3、上传数据

hive(default)>dfs -put /home/zkpk/datas/student.txt /user/hive/warehouse/student2/month=202203/day=03;

4、查询数据(查不到)

hive(default)>select * from student2;

5、修复分区

hive(default)>msck repair table student2;

方式二：上传数据后添加分区 alter table +表名+add partition(month string)

1、创建目录

[zkpk@master ~]$hadoop fs -mkdir -p /user/hive/warehouse/dept_partition3/month=202203/day=03

2、hive中创建表dept_partition3

hive(default)>create table dept_partition3(

deptno int,

dname string,

loc string

)

partitioned by(month string,day string)

row format delimited fields terminated by '\t';

3、上传数据

[zkpk@master ~]$hadoop fs -put /home/zkpk/datas/dept.txt /user/hive/warehouse/dept_partition3/month=202203/day=03

4、查询数据，查不到

hive(default)>select * from dept_partition3;

5、添加分区后可以查到数据

hive(default)>alter table dept_partition3 add partition(month='202203',day='03');

hive (default)> dfs -put /home/zkpk/datas/student.txt /user/zkpk/input;

[zkpk@master ~]$ hadoop fs -put /home/zkpk/datas/student.txt /user/zkpk/input

上述两条命令结果一样

方式三：通过load加载数据，不需要修复或者添加分区 load data local inpath +'本地路径'

1、创建目录

hive(default)>dfs -mkdir -p /user/hive/warehouse/dept_partition4/month=202203/day=03;

2、hive中创建表dept_partition4

hive(default)>create table dept_partition4(

deptno int,

dname string,

loc string

)

partitioned by(month string,day string)

row format delimited fields terminated by '\t';

3、通过load加载数据

hive(default)>load data local inpath '/home/zkpk/datas/dept.txt' into table dept_partition4

partition(month='202203',day='03');

4、可以直接查询数据

hive(default)>select * from dept_partition4;

四、数据的导入（五种）

1、通过load向hive表中导入数据

1)在hdfs上创建目录

dfs -mkdir -p /user/zkpk/input;

2)上传数据到hdfs

dfs -put /home/zkpk/datas/student.txt /user/zkpk/input

3)hive中创建student3

create table student3(

id int,

name string

)

row format delimited fields terminated by '\t';

4)把hdfs目录/user/zkpk/input下的数据加载到hive表中(hdfs上数据会被剪切走)

load data inpath '/user/zkpk/input' into table student3;

把本地数据加载到hive表中（本地数据复制走，不会被剪切）

load data local inpath '/home/zkpk/datas/student.txt' into table student3;

加载数据时加上overwrite表示原有hive表中的数据会被覆盖，如果不加则表示追加

load data local inpath '/home/zkpk/datas/student.txt' overwrite into table student3;

2、通过查询语句向表中国插入数据，关键字：insert（三种）

1）向表中插入一条数据

insert [overwrite] into table student3 values(1004,"lisi");(普通表)

insert [overwrite] into table student2 partition(month='202203') values(1004,"lisi");(分区表)

2）通过查询表向表中插入数据（表是提前创建好的）

create table student1(

id int,

name string

)

row format delimited fields terminated by '\t';

3)基本查询插入数据：向表中插入数据,如果加上overwrite则表示覆盖

insert into table student4 select id,name from student3;（普通表）

insert overwrite table student4 select id,name from student3;

insert into table student2 partition(month='202201') select id,name from student;(分区表)

4）多模式插入数据

from student3

insert [overwrite] into table student1 select id,name (普通表)

insert [overwrite] into table student4 select id,name;

一张表中插入数据，另一张表中是覆盖原有数据

from student3

insert [overwrite] into table student1 select id,name

insert [overwrite] overwrite table student4 select id,name;

from student

insert overwrite table student2 partition(month='202201') select id,name

insert into table student2 partition(month='202202') select id,name;(分区表)

3、查询语句中创建表并加载数据，关键字：as select

create table student4 as select id,name from student;

注意：student是分区表，但是student4是普通表

4、通过location指定加载路径（location指定的位置可以是hdfs的任意位置）

1）创建表

create table if not exists student6(

id int,

name string

)

row format delimited fields terminated by '\t'

location '/student6';

2)上传数据

dfs -put /home/zkpk/datas/student.txt /student6

3)在hive中查询

select * from student6;

5、通过import导入数据

import导入的数据必须是export导出的数据

二、导出数据（四种）

1、insert导出数据

将hive中表student4的数据导出到本地

insert overwrite local directory '/home/zkpk/datas/export/student'

select * from student4;

将导出数据格式化

insert overwrite local directory '/home/zkpk/datas/export/student'

row format delimited fields terminated by '\t'

select * from student4;

通过insert将数据导出到hdfs

insert overwrite directory '/user/zkpk/input/student4'

row format delimited fields terminated by '\t'

select * from student;

注意：1）insert既可以导入数据也可以导出数据。

2）insert导入数据可以是覆盖也可以是追加（insert overwrite或者insert into）

但是insert导出数据时只有覆盖，没有追加（insert overwrite）。

3)导出到hdfs的数据不能格式化。

2、通过hadoop命令导出数据

dfs -get /user/hive/warehouse/student /home/zkpk/datas/export/student1;

dfs -get /user/hive/warehouse/student4 /home/zkpk/datas/export/student4;

3、通过hive shell 命令导出数据

bin/hive -e 'select * from student4;' > /home/zkpk/datas/export/student2(普通表)

bin/hive -e 'select * from student;' > /home/zkpk/datas/export/student3(分区表)

4、通过export导出数据

export table student4 to '/user/zkpk/student4';(普通表)

import table student2 from '/user/zkpk/student4';

注意：1）student2 不用需要提前创建。

2）导入数据的表和导出数据的表必须一致。

import table student3 from '/student7';

FAILED: SemanticException [Error 10027]: Invalid path

注意：1)import导入的数据必须是export导出的数据，否则会报错

2)导出数据和导入数据的表必须一致

3）导出数据export...to ,导入数据import...from

export table student to '/user/zkpk/student';（分区表）

import table student3 from '/user/zkpk/student';

一、基本查询

创建emp表(比较规范的SQL语句写法)

create table if not exists emp(

empno int,

ename string,

job string,

mgr int,

hiredate string,

sal double,

comm double,

deptno int

)

row format delimited fields terminated by '\t';

导入数据：

load data local inpath '/home/zkpk/datas/emp.txt' overwrite into table emp;

1、全表查询：select * from student;

2、指定列查询：select id from student;

注意：查询的列之间用逗号隔开

3、起别名，关键字：as 或者空格

select ename as name,deptno dn from emp;

二、常用函数（计算总行数count、最大值max、最小值min、平均值avg、求和sum）

1、计算emp表总数据的总行数

select count(*) cnt from emp;

2、计算emp表中工资的最大值

select max(sal) max_sal from emp;

3、计算emp表中工资的最小值

select min(sal) min_sal from emp;

4、计算emp表中员工的平均工资

select avg(sal) avg_sal from emp;

5、计算emp表中员工工资的总和

select sum(sal) sum_sal from emp;

6、limit子句：用于返回数据的前几行

返回emp表中数据的前5行

select * from emp limit 5;

7、where语句，用于from后面，过滤掉不符合条件（过滤查询）

查询emp表中工资大于等于3000 的员工的姓名、部门和入职日期

select ename,sal,hiredate,deptno from emp where sal >= 3000;

三、比较运算符

1、查询薪水等于3000的员工的信息

select * from emp where sal = 3000;

2、查询工资在1000到3000之间的员工的信息

select * from emp where sal between 1000 and 3000;

注意：包含1000和3000的数值

3、查询绩效工资不为null的员工信息

select * from emp where comm is not null;

4、查询工资是1500和3000 的员工信息

select * from emp where sal in(1500,3000,5000);

5、查询工资不在2000到3000之间的员工信息（所有信息）(包括2000和3000这两个数)

select * from emp where sal not between 2000 and 3000;

6、like和Rlike

注意：% 代表零个或者多个字符；

_ 代表一个字符。

like和Rlike后面的正则表达式用单引号括起。

1）查询工资以2开头的员工信息

select * from emp where sal like '2%';

2)查询工资包含2的员工信息

select * from emp where sal Rlike '[2]';

3)查询工资的第二位是2的员工信息

select * from emp where sal like '_2%';

练习：

1、查询工资在1500到3000之间的员工编号、名字、入职日期和部门id

select empno,ename,sal,hiredate,deptno from emp where sal between 1500 and 3000;

2、查询工资是800、1500、3000和5000 的员工编号、名字、绩效工资和部门id

select empno,ename,sal,comm,deptno from emp where sal in (800,1500,3000,5000);

3、查询工资不在1500到3000之间的员工编号、名字、工种和部门id(不包含1500和3000这两个数据)

select empno,ename,sal,job,deptno from emp where sal not between 1500 and 3000;

注意：查询条件的列信息一定要显示出来。

四、逻辑运算符（and、or、not）

and表示两个条件并列

or表示两个条件只要其中一个满足即可

not表示否定

1、查询工资大于2000，部门是30的员工信息(两个条件必须同时成立)

select * from emp where sal > 2000 and deptno = 30;

2、查询工资大于2000或者部门是30的员工信息（两个条件成立其一即可）

select * from emp where sal > 2000 or deptno = 30;

3、查询不在部门20 和30 的员工信息

select * from emp where deptno not in (20,30);

五、分组（group by）

注意：1)group by 一般和聚合函数一起使用。

2）先分组后聚合

1、计算emp表中每个部门的平均工资

select deptno,avg(sal) avg_sal from emp group by deptno;

2、having语句

注意：1）having针对查询结果的列起作用，where是针对表中的列起作用。

2）where后面不能跟group by，而having只适用group by后面

案例：求emp表中部门平均工资大于2000的部门

分两步：1）求emp表中部门的平均工资

select deptno,avg(sal) avg_sal from emp group by deptno;

2）找出平均工资大于2000的部门

select deptno,avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;

排序和join语句

一、join语句：多个表查询

注意：（1）hive支持是等值join，不支持非等值

（2）关键字：join ... on(等值条件)

案例：根据员工表和部门表中的部门ID相等，查询emp表员工编号、员工名称和dept表的部门ID。

select e.empno,e.ename,d.deptno from emp e join dept d on e.deptno=d.deptno;

练习：根据部门表dept和location表的loc相等，查询部门表的部门名称、部门的id和location表中的地名。

1）创建表：

create table location(

loc string,

lname string

)

row format delimited fields terminated by '\t';

2)导入数据

load data local inpath '/home/zkpk/datas/location.txt' into table location;

3）联合查询

select d.dname,d.deptno,l.lname from dept d join location l on d.loc=l.loc;

1、内连接：(1)关键字：join ... on

(2) 内连接也叫连接，是最早的一种连接。

（3）只显示满足连接条件的数据，容易丢失数据。

案例：根据员工表和部门表中的部门ID相等，查询emp表员工编号、员工名称和dept表的部门ID。

select e.empno,e.ename,d.deptno from emp e join dept d on e.deptno=d.deptno;

将查询结果导出到本地并格式化。

insert overwrite local directory '/home/zkpk/datas/export/result'

row format delimited fields terminated by '\t'

select e.empno,e.ename,d.deptno from emp e join dept d on e.deptno=d.deptno;

2、左外连接：

（1）关键字：left join ... on

（2）join左边的表为左表，join右边的表为右表。

（3）显示左表中的全部数据，右表显示满足条件的数据。

案例：根据员工表和部门表中的部门ID相等，使用左外连接查询emp表员工编号、员工名称和dept表的部门ID。

select e.empno,e.ename,d.deptno from emp e left join dept d on e.deptno=d.deptno;

练习：把dept作为左表，emp为右表

3、右外连接：

（1）关键字：right join...on

（2）显示右表中的所有数据，左表显示满足连接条件的数据

案例：根据员工表和部门表中的部门ID相等，使用右外连接查询emp表员工编号、员工名称和dept表的部门

select e.empno,e.ename,d.deptno from emp e right join dept d on e.deptno=d.deptno;

4、满外连接：

（1）关键字：full join... on

（2）显示两个表中的所有数据

案例：根据员工表和部门表中的部门ID相等，使用右外连接查询emp表员工编号、员工名称和dept表的部门

select e.empno,e.ename,d.deptno from emp e full join dept d on e.deptno=d.deptno;

5、多表连接：连接n个表，需要n-1个条件

案例：查询emp表中的empno、enamel和dept表中dname、deptno以及location表中的lname

select e.empno,e.ename,d.dname,d.deptno,l.lname from emp e join dept d

on e.deptno=d.deptno join location l on d.loc=l.loc;(内连接)

练习：查询emp表中的empno、enamel和dept表中dname、deptno以及location表中的lname

要求：第一个连接使用左外连接，第二连接使用右外连接。截图发钉钉群

select e.empno,e.ename,d.dname,d.deptno,l.lname from emp e

left join dept d on e.deptno=d.deptno

right join location l on d.loc=l.loc;

二、排序

1、全局排序：（1）一个MR程序

（2）order by 跟在from后面

（3）默认升序：asc ,降序：desc

（4）可以按照别名进行排序

（5）可以多个列排序，列名之间用逗号隔开即可

案例：查询员工信息按员工工资升序排序：select * from emp order by sal asc;

练习：查询员工信息按员工工资降序排序：select empno,sal from emp order by sal desc;

案例：查询员工编号、姓名，并按工资的两倍进行降序排序

select empno,ename,2*sal two_sal from emp order by two_sal desc;

查询员工编号、姓名，按照部门升序和工资降序进行排序

select empno,ename,sal,deptno from emp order by deptno desc,sal desc;

2、内部排序：sort by

（1）每个MapReduce内部有序，全局不一定有序

（2）适用于多个reduce

（3）为查看方便输出结果导出到本地

案例：查询员工编号、姓名，并按工资的降序排序，并将输出结果导出到本地进行格式化；

insert overwrite local directory '/home/zkpk/datas/export/result'

row format delimited fields terminated by '\t'

select empno,ename,sal from emp sort by sal desc;

作业：1、查询员工编号、姓名和部门ID，按照员工编号全局排序并将结果

导出到hdfs的/user/zkpk/input目录下。

2、开启4个reduce，查询员工编号、姓名和部门ID，按照员工工资的两倍内部排序，

将输出结果导出到本地目录/home/zkpk/datas/export

分桶

1.为社么对数据进行分桶

当对数据进行分区存储，但是没有达到合理的效果时，即某个分区数据过多，则需要对数据进行分桶

分区是针对数据存储目录，分桶是针对数据本身的

2、创建分桶表：关键字：clustered by

Create table if not exists stu_buck(

Id int,

name string

)

Clustered by(id) into 4 buckets //按id对数据分为4桶

Row formate delimited fields terminated by ‘\t’;

3.对于分桶表，导入数据之前需要做什么？

Reduce个数与分桶数量进行匹配

Set hive.enforce.bucketing=true;

4.导入数据，通过子查询的方式进行导入数据,关键字：cluster by

Insert into table stu_buck select * from stu cluster by(id);

5.对分桶数据进行抽样：关键字：tablesample x=2开始 y=总桶数/取桶数

Select * from stu_buck tablesample(bucket x out of y on id);

练习：创建emp表分桶表，分桶字段为deptno，分3桶，

并进行抽样查询，抽取1桶数据，从第2桶开始抽取。

1、创建分桶表

create table if not exists emp_buck(

empno int,

ename string,

job string,

mgr int,

hiredate string,

sal double,

comm double,

deptno int

)

clustered by(deptno) into 3 buckets

row format delimited fields terminated by '\t';

2、设置分桶属性：set hive.enforce.bucketing=true;

3、通过子查询的方式导入数据：

insert into table emp_buck select * from emp cluster by(deptno);

4、抽样查询：抽一桶，从第2桶开始抽样说明x=2,y=总桶数/抽取桶数=3/1=3

select * from emp_buck tablesample(bucket 2 out of 3 on deptno);

Hived的函数：hive的内置函数和hive的自定义函数

hive的函数：hive的内置函数和hive的自定义函数

一、hive内置函数

1、hive内置函数即hive系统自带的函数，分为四大类：

1）简单函数（9类）

2）聚合函数（6类）

3）集合函数（3类）

4）特殊函数（3类）

2、hive内置函数的查看和使用

1）查看hive内置函数：show functions;(216)

2)查看具体函数的用法：desc function max;

3)查看具体函数的详细用法：desc function extended lower;

二、hive的自定义函数

1、hive自定义函数即用户自己定义的函数：分为三大类

1）UDF：一进一出。比如字母的大小写转换（一对一）

2）UDAF：聚集函数，多进一出，比如求最大最小值。（多对一）

3）UDTF：一进多出，比如拆分单词，（一对多）

2、自定义UDF函数（要求熟悉自定义函数的流程）

案例：大小写字母的转换

环境准备：

3、自定函数的过程：

1）自定义的类要继承UDF

2）实现evaluate函数，业务逻辑代码写入此函数

3）代码写完保存，打jar包

4）把jar包导入Linux系统

5）在hive的命令行窗口创建函数：

a）添加jar包到class path：add jar /home/zkpk/udf.jar

b）创建临时函数或者是永久函数

create [temporary] function +自定义函数名字+as +‘jar包的全类名’；

6）使用创建的函数。

7）销毁自定义的临时函数

drop [temporary] function +函数名字；

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：怎么开启javaflightrecorder 怎么开启微信授权登录

下一篇：python 的queue完成取号叫号 python queue.put

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯