hive的分桶操作 hive分桶原理

转载

mob6454cc692b0f 2023-07-12 17:07:24

文章标签 hive的分桶操作 hive big data 字段数据 文章分类 Hive 大数据

一、回顾分区表

为什么有分区？
随着系统运行时间增长，表的数据量越来越大，而hive查询时通常是是全表扫描，这样将导致大量的不必要的数据扫描，从而大大减低查询效率。从而引进分区技术，使用分区技术，避免hive全表扫描，提升查询效率，可以查询时指定查询条件（分区字段=’’ 来实现)。
hive的分区本质是在表目录下面创建目录，但是该分区字段是一个伪列，不真实存在于数据中，一张表可以有一个或者多个分区，分区下面也可以有一个或者多个分区；
通过PARTIONED BY(col_name data_type)来实现分区，注意：hive的分区字段使用的是表外字段；

二、为什么分桶？

单个分区或者表中的数据量越来越大，当分区不能更细粒的划分数据时，所以会采用分桶技术将数据更细粒度的划分和管理。
实现手段：
[CLUSTERED BY (col_name, col_name, …)
[SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS]
关键字：BUCKET
分桶的意义：
1、为了保存分桶查询结果的分桶结构（数据已经按照分桶字段进行了hash散列）
2、分桶表数据进行抽样和JOIN时可以提高MR程序效率

三、分桶表的使用

1、创建一个带分桶定义的表（分桶表）

--创建分桶表，指定分桶的字段，不指定数据的排序规则
create table if not exists buc1(
uid int,
uname string,
uage int
)
clustered by (uid) into 4 buckets
row format delimited
fields terminated by ','
;
--创建分桶表，指定分桶的字段，指定数据的排序规则
create table if not exists buc2(
uid int,
uname string,
uage int
)
clustered by (uid) 
sorted by (uid desc) into 4 buckets
row format delimited
fields terminated by ','
;

2、加载数据：

--第一种方式：直接load一个文档里面的数据到分桶表里面;
load data local inpath '/usr/local/hive/test/3.txt' into table buc1;
--第二种方式：使用insert into(overwrite)方式来加载，前提是先有 buc_temp（只是一般表），而且字段个数一致，并且buc_temp指定分桶字段
insert overwrite table buc1 select uid,uname,uage from buc_temp cluster by (uid);
--第三种方式：依然使用insert into(overwrite)方式来加载，只不过可以指定数据的排序规则（cluster by（与第二种方式相同）或distribute by () sort by()可以相同字段也可以不同字段，指定asc或desc）
insert overwrite table buc2 select uid,uname,uage from buc_temp cluster by (uid);
insert overwrite table buc3 select uid,uname,uage from buc_temp distribute by (uid) sort by (uid asc);
insert overwrite table buc3 select uid,uname,uage from buc_temp distribute by (uid) sort by (uid desc);
insert overwrite table buc3 select uid,uname,uage from buc_temp distribute by (uid) sort by (uage desc);

3、对分桶表的查询

--1、查询全部：
select * from buc2;
select * from buc2 tablesample(bucket 1 out of 1) 
--查询第几桶：
select * from buc3 tablesample(bucket 1 out of 4 on uid);                            //除4余0
select * from buc3 tablesample(bucket 1 out of 2 on uid);

注：tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y) 。
y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例。例如，table总共分了4份，当y=2时，抽取(4/2=)2个bucket的数据，当y=8时，抽取(4/8=)1/2个bucket的数据。
x表示从哪个bucket开始抽取，如果需要取多个分区，以后的分区号为当前分区号加上y。例如，table总bucket数为4，tablesample(bucket 1 out of 2)，表示总共抽取（4/2=）2个bucket的数据，抽取第1(x)个和第3(x+y)个bucket的数据。
注意：x的值必须小于等于y的值

分桶总结：

1、定义：
clustered by (uid) – 指定分桶的字段
sorted by (uid desc) – 指定数据的排序规则，表示预期的数据就是以这里设置的字段以及排序规则来进行存储
2、导数据
cluster by (uid) – 指定getPartition以哪个字段来进行hash散列，并且排序字段也是指定的字段，默认以正序进行排序
distribute by(uid) – 指定getPartition以哪个字段来进行hash散列
sort by(uid asc) – 指定排序字段，以及排序规则
–更灵活的方式，这种数据获取方式可以分别指定getPartition的字段和sort的字段
cluster by (uid)与distribute by(uid) sort by (uid asc)结果是一样的

实例

举例说明一下：按照性别进行分区（1男2女），在分区中按照uid的奇偶进行分桶：

uid	uname	usex
1	铁拐李	1
2	张果老	1
3	汉钟离	1
4	韩湘子	1
5	吕洞宾	1
6	蓝采和	2
7	何仙姑	2
8	曹国舅	1

1、创建临时表：

create table if not exists baxian_temp(
uid int,
uname string,
usex int
)
row format delimited 
fields terminated by ' '
;

2、加载数据：

load data local inpath '/usr/local/hivedata/stu.dat' into table baxian_temp;

3、创建分区分桶表：

create table if not exists baxian(
uid int,
uname string
)
partitioned by (sex int)
clustered by (uid) into 2 buckets
row format delimited 
fields terminated by ' '
;
//insert into方式加载数据：
insert into table baxian partition(sex) select uid,uname,usex from baxian_temp cluster by (uid);

需求：
查询性别为女性的、并且学号为奇数的神仙：

select * from baxian tablesample(bucket 2 out of 2 on uid) where sex=2;

问题：目前没有取出要取出的结果
注意：
1、分区使用的是表外字段，分桶使用的是表内字段
2、分桶是更细粒度的划分、管理数据，更多用来做数据抽样、JOIN操作

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：hive系统的搭建 hive系统的作用

下一篇：hbase 使用教程 hbase key

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯