Hive分区表分桶表的认识与区别

转载

mob604756fb3b48 2021-07-22 18:57:00

Hive 分区

分区表实际上是在表的目录下在以分区命名，建子目录

作用：进行分区裁剪，避免全表扫描，减少MapReduce处理的数据量，提高效率

一般在公司的hive中，所有的表基本上都是分区表，通常按日期分区、地域分区

分区表在使用的时候记得加上分区字段

分区也不是越多越好，一般不超过3级，根据实际业务衡量

建立分区表：

create table students_pt
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(pt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

增加一个分区：

alter table students_pt add partition(pt='20210622');

删除一个分区：

alter table students_pt drop partition(pt='20210112');

查看某个表的所有分区

show partitions students_pt; // 推荐这种方式（直接从元数据中获取分区信息）

select distinct pt from students_pt; // 不推荐

往分区中插入数据：

insert into table students_pt partition(pt='20210101') select * from students;
load data local inpath '/usr/local/soft/data/students.txt' into table students_pt partition(pt='20210111');

查询某个分区的数据：

// 全表扫描，不推荐，效率低
select count(*) from students_pt;

// 使用where条件进行分区裁剪，避免了全表扫描，效率高
select count(*) from students_pt where pt='20210101';

// 也可以在where条件中使用非等值判断
select count(*) from students_pt where pt<='20210112' and pt>='20210110';

Hive动态分区

有的时候我们原始表中的数据里面包含了 ''日期字段 dt''，我们需要根据dt中不同的日期，分为不同的分区，将原始表改造成分区表。

hive默认不开启动态分区

动态分区：根据数据中某几列的不同的取值划分不同的分区

开启Hive的动态分区支持

# 表示开启动态分区
hive> set hive.exec.dynamic.partition=true;
# 表示动态分区模式：strict（需要配合静态分区一起使用）、nostrict
# strict： insert into table students_pt partition(dt='anhui',pt) select ......,pt from students;
hive> set hive.exec.dynamic.partition.mode=nostrict;
# 表示支持的最大的分区数量为1000，可以根据业务自己调整
hive> set hive.exec.max.dynamic.partitions.pernode=1000;

建立原始表并加载数据

create table students_dt
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string,
    dt string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

建立分区表并加载数据

create table students_dt_p
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

使用动态分区插入数据

// 分区字段需要放在 select 的最后，如果有多个分区字段 同理，它是按位置匹配，不是按名字匹配
insert into table students_dt_p partition(dt) select id,name,age,gender,clazz,dt from students_dt;
// 比如下面这条语句会使用age作为分区字段，而不会使用student_dt中的dt作为分区字段
insert into table students_dt_p partition(dt) select id,name,age,gender,dt,age from students_dt;

多级分区

create table students_year_month
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string,
    year string,
    month string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

create table students_year_month_pt
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(year string,month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

insert into table students_year_month_pt partition(year,month) select id,name,age,gender,clazz,year,month from students_year_month;

Hive分桶

分桶实际上是对文件（数据）的进一步切分

Hive默认关闭分桶

作用：在往分桶表中插入数据的时候，会根据 clustered by 指定的字段进行hash分组对指定的buckets个数进行取余，进而可以将数据分割成buckets个数个文件，以达到是数据均匀分布，方便我们取抽样数据，提高Map join效率

分桶字段需要根据业务进行设定可以解决数据倾斜问题

开启分桶开关

hive> set hive.enforce.bucketing=true;

建立分桶表

create table students_buks
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
CLUSTERED BY (clazz) into 12 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

往分桶表中插入数据

// 直接使用load data 并不能将数据打散
load data local inpath '/usr/local/soft/data/students.txt' into table students_buks;

// 需要使用下面这种方式插入数据，才能使分桶表真正发挥作用
insert into students_buks select * from students;

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：PageRank

下一篇：spark优化总结

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

Hive分区表分桶表的认识与区别

Hive分区表分桶表的认识与区别

Hive 分区

建立分区表：

增加一个分区：

删除一个分区：

查看某个表的所有分区

往分区中插入数据：

查询某个分区的数据：

Hive动态分区

开启Hive的动态分区支持

建立原始表并加载数据

建立分区表并加载数据

使用动态分区插入数据

多级分区

Hive分桶

开启分桶开关

建立分桶表

往分桶表中插入数据

51CTO博客