hive分区表建表 hive创建分区

转载

网络小墨 2023-07-07 18:40:01

文章标签 hive分区表建表 hive 数据插入数据 文章分类 Hive 大数据

Hive 的分区通过在创建表时启动 PARTITION BY 实现，用来分区的维度并不是实际数据的某一列，具体分区的标志是由插入内容时给定的。当要查询某一分区的内容时可以采用 WHERE 语句，例如使用 “WHERE tablename.partition_key>a” 创建含分区的表。创建分区语法如下。

CREATE TABLE table_name(
...
)
PARTITION BY (dt STRING,country STRING)

1、创建分区

　　Hive 中创建分区表没有什么复杂的分区类型（范围分区、列表分区、hash 分区，混合分区等）。分区列也不是表中的一个实际的字段，而是一个或者多个伪列。意思是说，在表的数据文件中实际并不保存分区列的信息与数据。

需要注意，Partitioned by子句中的列定义是表中正式的列，称为“分区列”partition column。
但是，数据文件并不包含这些列的值，因为他们源于目录名。

创建一个简单的分区表。

hive> create table partition_test
(member_id string,
 name string
) 
partitioned by (stat_date string,province string) 
row format delimited 
fields terminated by ',';

这个例子中创建了 stat_date 和 province 两个字段作为分区列。通常情况下需要预先创建好分区，然后才能使用该分区。例如：

hive> alter table partition_test add partition (stat_date='2016-04-28',province='beijing');

这样就创建了一个分区。这时会看到 Hive 在HDFS 存储中创建了一个相应的文件夹。

$ hadoop fs -ls /user/hive/warehouse/partition_test/stat_date=2015-01-18
/user/hive/warehouse/partition_test/stat_date=2016-04-28/province=beijing ----显示刚刚创建的分区

每一个分区都会有一个独立的文件夹，在这个例子中stat_date是主层次，province是副层次，

所有stat_date='20150118'，而province不同的分区都会在

/user/hive/warehouse/partition_test/stat_date=20110728 下面，

而stat_date不同的分区都会在

/user/hive/warehouse/partition_test/ 下面；

如：$ hadoop fs -ls /user/hive/warehouse/partition_test/
        Found 2 items
drwxr-xr-x - admin supergroup 0 2015-01-28 19:46
 /user/hive/warehouse/partition_test/stat_date=20150126
drwxr-xr-x - admin supergroup 0 2015-01-29 09:53
 /user/hive/warehouse/partition_test/stat_date=20150128

注意，因为分区列的值要转化为文件夹的存储路径，所以如果分区列的值中包含特殊值，如 '%', ':', '/', '#',它将会被使用%加上2字节的ASCII码进行转义，如：

hive> alter table partition_test add partition (stat_date='2011/07/28',province='zhejiang');
      OK
      Time taken: 4.644 seconds

$hadoop fs -ls /user/hive/warehouse/partition_test/
Found 3 items

drwxr-xr-x - admin supergroup 0 2015-01-29 10:06 
/user/hive/warehouse/partition_test/stat_date=2015/01/28

drwxr-xr-x - admin supergroup 0 2015-01-26 19:46
/user/hive/warehouse/partition_test/stat_date=20150129

drwxr-xr-x - admin supergroup 0 2016-01-29 09:53
/user/hive/warehouse/partition_test/stat_date=20150128

2、插入数据；

使用一个辅助的非分区表 partition_test_input 准备向 partition_test 中插入数据，实现步骤如下。
1) 查看 partition_test_input 表的结构和数据，命令如下：

hive> desc partition_test_input;  -- 表结构
hive> select * from partition_test_input;  -- 表数据

2）向 partition_test 的分区中插入数据，命令如下：

insert overwrite table partition_test 
partition(stat_date='2015-01-18',province='jiangsu')
select member_id,name from partition_test_input 
where stat_date='2016-04-28' 
and province='jiangsu';

向多个分区插入数据，命令如下。

hive> from partition_test_input
insert overwrite table partition_test partition(stat_date='2016-04-28',province='jiangsu') 
select member_id,name from partition_test_input where stat_date='2016-04-28' and province='jiangsu'

insert overwrite table partition_test partition(stat_date='2016-04-28',province='sichuan') 
select member_id,name from partition_test_input where stat_date='2016-04-28' and province='sichuan'

insert overwrite table partition_test partition(stat_date='2016-04-28',province='beijing') 
select member_id,name from partition_test_input where stat_date='2016-04-28' and province='beijing';

特别要注意，在其他数据库中，一般向分区表中插入数据时系统会校验数据是否符合该分区，如果不符合会报错。而在hive中，向某个分区中插入什么样的数据完全是由人来控制的，因为分区键是伪列，不实际存储在文件中，如：

hive> desc partition_test_input;
OK
stat_date string
member_id string
name string
province string

hive> select * from partition_test_input;
OK
20110526 1 liujiannan liaoning
20110526 2 wangchaoqun hubei
20110728 3 xuhongxing sichuan
20110728 4 zhudaoyong henan
20110728 5 zhouchengyu heilongjiang

然后我向partition_test的分区中插入数据：

hive> insert overwrite table partition_test partition(stat_date='20110728',province='henan') 
select member_id,name from partition_test_input 
where stat_date='20110728' and province='henan';

Total MapReduce jobs = 2
...
1 Rows loaded to partition_test
OK

hive> insert overwrite table partition_test partition(stat_date='20110527',province='liaoning') select member_id,name from partition_test_input;
Total MapReduce jobs = 2
...
5 Rows loaded to partition_test
OK

hive> select * from partition_test where stat_date='20110527' and province='liaoning';
OK
1 liujiannan 20110527 liaoning
2 wangchaoqun 20110527 liaoning
3 xuhongxing 20110527 liaoning
4 zhudaoyong 20110527 liaoning
5 zhouchengyu 20110527 liaoning

可以看到在partition_test_input中的5条数据有着不同的stat_date和province，但是在插入到partition(stat_date='20110527',province='liaoning')这个分区后，5条数据的stat_date和province都变成相同的了，因为这两列的数据是根据文件夹的名字读取来的，而不是实际从数据文件中读取来的

3、动态分区

按照上面的方法向分区表中插入数据，如果数据源很大，针对一个分区就要写一个 insert ，非常麻烦。使用动态分区可以很好地解决上述问题。动态分区可以根据查询得到的数据自动匹配到相应的分区中去。

动态分区可以通过下面的设置来打开：

set hive.exec.dynamic.partition=true;  
set hive.exec.dynamic.partition.mode=nonstrict;

动态分区的使用方法很简单，假设向 stat_date=’2016-04-28’ 这个分区下插入数据，至于 province 插到哪个子分区下让数据库自己来判断。stat_date 叫做静态分区列，province 叫做动态分区列。

hive> insert overwrite table partition_test partition(stat_date='2016-04-28',province)
select member_id,name province from partition_test_input where stat_date='2016-04-28';

注意，动态分区不允许主分区采用动态列而副分区采用静态列，这样将导致所有的主分区都要创建副分区静态列所定义的分区。

hive.exec.max.dynamic.partitions.pernode：
每一个 MapReduce Job 允许创建的分区的最大数量，如果超过这个数量就会报错（默认值100）。

hive.exec.max.dynamic.partitions：一个 dml 语句允许创建的所有分区的最大数量（默认值100）。 
hive.exec.max.created.files：所有 MapReduce Job 允许创建的文件的最大数量（默认值10000）。

尽量让分区列的值相同的数据在同一个 MapReduce 中，这样每一个 MapReduce 可以尽量少地产生新的文件夹，可以通过 DISTRIBUTE BY 将分区列值相同的数据放到一起，命令如下。

insert overwrite table partition_test 
partition(stat_date,province)
select memeber_id,name,stat_date,province 
from partition_test_input 
distribute by stat_date,province;

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。