hive 信息查询分区 hive查分区表的数据

转载

charlesc 2023-07-12 10:58:51

文章标签 hive 信息查询分区 hive hadoop big data 数据 文章分类 Hive 大数据

内容目录

分区表

分区表

分区表

Hive的存储是在hdfs上，当Hive创建一张表的时候，其实是在hdfs上创建了一个文件夹。在查询数据的时候，也是将文件夹下所有的文件进行读取，这在海量数据的应用中无疑是非常耗时的，为了进行查找优化，可以使用分区分桶，将数据按照分区分开，在查询的时候查看是哪个区或桶，到相应的位置查找即可。

分区表

1）分区表实际上就是对应一个HDFS文件系统上的独立的文件夹。

2）该文件夹下是该分区所有的数据文件。

3）Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。

4）在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

①创建分区表

1）语法：

create table dept_partition(
deptno int,
dname string,
loc string
)
partitioned by (day string)
row format delimited 
fields terminated by '\t';

注意：分区字段不能是表中已经存在的字段，可以将分区字段看作是一个伪列。

hive 信息查询分区 hive查分区表的数据_hive 信息查询分区

2）在指定了分区的数据表加载数据的时候，也必须指定分区

load data local inpath '/opt/module/hive/datas/dept.txt' into table dept_partition partition(day='20201222');

hive 信息查询分区 hive查分区表的数据_big data_02

3）查询分区表中的数据

（1）单分区查询

select * from dept_partition where day='20201222';

hive 信息查询分区 hive查分区表的数据_hive_03

（2）多分区联合查询

方法1：

select * from dept_partition where day='20211220'
    union
    select * from dept_partition where day='20201221'
    union
    select * from dept_partition where day='20201222';

hive 信息查询分区 hive查分区表的数据_hive_04

方式2：

select * from dept_partition 
where day='20211220' 
or
day='20201221' 
or 
day='20201222' ;

hive 信息查询分区 hive查分区表的数据_hive 信息查询分区_05

4）增加分区

（1）添加单个分区

alter table dept_partition add partition(day='20201223') ;

（2）同时添加多个分区

alter table dept_partition add partition(day='20201224') partition(day='20201225');

5）删除分区

（1）删除单个分区

alter table dept_partition drop partition (day='20201221');

（2）删除多个分区

alter table dept_partition drop partition (day='20201222'), partition(day='20201223');

注意：在添加同时添加多个分区时，多个分区间用” ”(空格)间隔，在同时删除多个分区时，多个分区间使用”,”间隔

6）查看表中有多少分区

show partitions dept_partition;

hive 信息查询分区 hive查分区表的数据_hive 信息查询分区_06

7）查看分区表结构

desc formatted dept_partition;

hive 信息查询分区 hive查分区表的数据_hadoop_07

②二级分区

当一级分区之后，数据量还是很大，对查询来说耗时也是很大，就可以进一步缩小范围，设置二级分区

1）语法：

create table dept_partition2(
deptno int,
dname string,
loc string
)
partitioned by (day string, hour string)
row format delimited 
fields terminated by '\t';

hive 信息查询分区 hive查分区表的数据_big data_08

2）加载数据

load data local inpath '/opt/module/hive/datas/dept.txt' into table dept_partition2 partition(day='20211220', hour='11');

hive 信息查询分区 hive查分区表的数据_hive_09

3）查询分区数据

select * from dept_partition2 where day='20211220' and hour='11';

hive 信息查询分区 hive查分区表的数据_big data_10

③让数据和分区表产生关联的三种方式

1）方式1：上传数据修复

（1）上传数据

dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211220/hour=12;
dfs -put /opt/module/hive/datas/dept.txt /user/hive/warehouse/dept_partition2/day=20211220/hour=12;

hive 信息查询分区 hive查分区表的数据_hive_11

（2）查询数据（查询不到刚上传的数据）

select * from dept_partition2 where day='20211220' and hour='12';

hive 信息查询分区 hive查分区表的数据_hadoop_12

（3）修复数据表：msck repair

msck repair table dept_partition2;

hive 信息查询分区 hive查分区表的数据_hive 信息查询分区_13

2）方式2：上传数据后添加分区

（1）上传数据

dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211220/hour=13;
dfs -put /opt/module/hive/datas/dept.txt /user/hive/warehouse/dept_partition2/day=20211220/hour=13;

（2）执行添加分区

alter table dept_partition2 add partition(day='20211220',hour='13');

（3）查询数据

select * from dept_partition2 where day='20211220' and hour='13';

3）创建文件夹后load数据到分区

（1）创建目录

dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20211220/hour=14;

（2）上传数据

load data local inpath '/opt/module/hive/datas/dept.txt' into table
 dept_partition2 partition(day='20211220',hour='14');

（3）查询数据

select * from dept_partition2 where day='20211220' and hour='14';

③动态分区

1）开启动态分区功能

set hive.exec.dynamic.partition=true;

2）设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）

set hive.exec.dynamic.partition.mode=nonstrict;

3）在所有执行MR的节点上，最大一共可以创建多少个动态分区。默认1000

set hive.exec.max.dynamic.partitions=1000;

4）在每个执行MR的节点上，最大可以创建多少个动态分区

该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。

set hive.exec.max.dynamic.partitions.pernode=100;

5）整个MR Job中，最大可以创建多少个HDFS文件。默认100000

set hive.exec.max.created.files=100000;

6）当有空分区生成时，是否抛出异常。一般不需要设置。默认false

set hive.error.on.empty.partition=false;

hive 信息查询分区 hive查分区表的数据_数据_14

7）示例：

需求：将dept表中的数据按照地区（loc字段），插入到目标表dept_partition_dy的相应分区中。

（1）创建hive表

create table dept_partition_dy(
id int,
name string
)
partitioned by (loc int)
row format delimited 
fields terminated by '\t';

（2）以动态分区的方式向dept_partition_dy表中插入数据

insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept;

（3）查看目标分区表的分区情况

show partitions dept_partition;

hive 信息查询分区 hive查分区表的数据_big data_15

补充：

在hive3.0新特性中，不指定分区字段也可以，但是要确定在建表的时候的字段和字段可以匹配上

insert into table dept_partition_dy select deptno, dname, loc from dept;

hive 信息查询分区 hive查分区表的数据_hive 信息查询分区_16

查看分区情况

hive 信息查询分区 hive查分区表的数据_hadoop_17

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：android oom android oom 解决

下一篇：hive 储存 hive 存储的文件格式

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯