Hive分区表关键字 hive分区表建表语句

转载

mob64ca13fbd761 2023-09-07 21:00:19

文章标签 Hive分区表关键字大数据移动开发 hive 数据 文章分类 Hive 大数据

一、分区表创建与说明

必须在表定义时创建partition

a、单分区建表语句：create table day_table (id int, content string) partitioned by (dt string);单分区表，按天分区，在表结构中存在id，content，dt三列。

以dt为文件夹区分

b、双分区建表语句：create table day_hour_table (id int, content string) partitioned by (dt string, hour string);双分区表，按天和小时分区，在表结构中新增加了dt和hour两列。

先以dt为文件夹，再以hour子文件夹区分

添加分区表语法（表已创建，在此基础上添加分区）：

ALTER TABLE table_name ADD
 partition_spec [ LOCATION 'location1' ]
 partition_spec [ LOCATION 'location2' ] ... ALTER TABLE day_table ADD
 PARTITION (dt='2008-08-08', hour='08')
 location '/path/pv1.txt'

删除分区语法：

ALTER TABLE table_name DROP
 partition_spec, partition_spec,...

用户可以用 ALTER TABLE DROP PARTITION 来删除分区。分区的元数据和数据将被一并删除。例：

ALTER TABLE day_hour_table DROP PARTITION (dt='2008-08-08', hour='09');

数据加载进分区表中语法：

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

例：

LOAD DATA INPATH '/user/pv.txt' INTO TABLE day_hour_table PARTITION(dt='2008-08- 08', hour='08'); LOAD DATA local INPATH '/user/hua/*' INTO TABLE day_hour partition(dt='2010-07- 07');当数据被加载至表中时，不会对数据进行任何转换。Load操作只是将数据复制至Hive表对应的位置。数据加载时在表下自动创建一个目录 
 
基于分区的查询的语句 
 ：SELECT day_table.* FROM day_table WHERE day_table.dt>= '2008-08-08';

查看分区语句：

hive> show partitions day_hour_table; OK dt=2008-08-08/hour=08 dt=2008-08-08/hour=09 dt=2008-08-09/hour=09

二、静态与动态分区表

partition是hive提供的一种机制：用户通过指定一个或多个partition key，决定数据存放方式，进而优化数据的查询
一个表可以指定多个partition key，每个partition在hive中以文件夹的形式存在。

1、静态分区（static partition）：
编辑文件：

/home/work/data/test3.txt; /home/work/data/test4.txt;
 $ cat /home/work/data/test3.txt
 1,zxm
 2,ljz
 3,cds
 4,mac
 5,android
 6,symbian
 7,wp

 $ cat /home/work/data/test4.txt
 8,zxm
 9,ljz
 10,cds
 11,mac
 12,android
 13,symbian
 14,wp

建表：

hive> create table student_tmp(id INT, name STRING)
       > partitioned by(academy STRING, class STRING)
       > row format delimited fields terminated by ',';
 OK
 Time taken: 6.505 seconds

id，name是真实列，partition列academy和class是伪列

load数据：(此处直接load数据进partition，在hive 0.6之前的版本，必须先创建好partition，数据才能导入)

hive> load data local inpath '/home/work/data/test3.txt' into table student_tmp partition(academy='computer', class='034');
 Copying data from file:/home/work/data/test3.txt
 Copying file: file:/home/work/data/test3.txt
 Loading data to table default.student_tmp partition (academy=computer, class=034)
 OK
 Time taken: 0.898 seconds
 hive>  load data local inpath '/home/work/data/test3.txt' into table student_tmp partition(academy='physics', class='034'); 
 Copying data from file:/home/work/data/test3.txt
 Copying file: file:/home/work/data/test3.txt
 Loading data to table default.student_tmp partition (academy=physics, class=034)
 OK
 Time taken: 0.256 seconds

查看hive文件结构：

$ hadoop fs -ls  /user/hive/warehouse/student_tmp/
 Found 2 items
 drwxr-xr-x   - work supergroup          0 2012-07-30 18:47 /user/hive/warehouse/student_tmp/academy=computer
 drwxr-xr-x   - work supergroup          0 2012-07-30 19:00 /user/hive/warehouse/student_tmp/academy=physics
 $ hadoop fs -ls /user/hive/warehouse/student_tmp/academy=computer
 Found 1 items
 drwxr-xr-x   - work supergroup          0 2012-07-30 18:47 /user/hive/warehouse/student_tmp/academy=computer/class=034

查询数据：

hive> select * from student_tmp where academy='physics';
 OK
 1       zxm     physics 034
 2       ljz     physics 034
 3       cds     physics 034
 4       mac     physics 034
 5       android physics 034
 6       symbian physics 034
 7       wp      physics 034
 Time taken: 0.139 seconds

以上是static partition的示例，static partition即由用户指定数据所在的partition，在load数据时，指定partition(academy='computer', class='034');
static partition常适用于使用处理时间作为partition key的例子。
但是，我们也常常会遇到需要向分区表中插入大量数据，并且插入前不清楚数据归宿的partition，此时，我们需要dynamic partition。
使用动态分区需要设置hive.exec.dynamic.partition参数值为true。
可以设置部分列为dynamic partition列，例如：partition(academy='computer', class);
也可以设置所有列为dynamic partition列，例如partition(academy, class);
设置所有列为dynamic partition列时，需要设置hive.exec.dynamic.partition.mode=nonstrict
需要注意的是，主分区为dynamic partition列，而副分区为static partition列是不允许的，例如partition(academy, class=‘034’);是不允许的
2、动态分区（dynamic partition）：

建表
 hive> create table student(id INT, name STRING)                                                                            
     > partitioned by(academy STRING, class STRING)
     > row format delimited fields terminated by ',';
 OK
 Time taken: 0.393 seconds

 设置参数
 hive> set hive.exec.dynamic.partition.mode=nonstrict; 
 hive> set hive.exec.dynamic.partition=true;

 导入数据：
 hive> insert overwrite table student partition(academy, class)      
     > select id,name,academy,class from student_tmp                 
     > where class='034';                                      
 Total MapReduce jobs = 2
 .........
 OK
 Time taken: 29.616 seconds

 查询数据： hive> select * from student where academy='physics'; 
 OK
 1       zxm     physics 034
 2       ljz     physics 034
 3       cds     physics 034
 4       mac     physics 034
 5       android physics 034
 6       symbian physics 034
 7       wp      physics 034
 Time taken: 0.165 seconds

 查看文件：
 $ hadoop fs -ls  /user/hive/warehouse/student/
 Found 2 items
 drwxr-xr-x   - work supergroup          0 2012-07-30 19:22 /user/hive/warehouse/student/academy=computer
 drwxr-xr-x   - work supergroup          0 2012-07-30 19:22 /user/hive/warehouse/student/academy=physics

3、总结：
hive partition是通过将数据拆分成不同的partition放入不同的文件，从而减少查询操作时数据处理规模的手段。
例如，Hive Select查询中，如果没有建partition，则会扫描整个表内容，这样计算量巨大。如果我们在相应维度做了partition，则处理数据规模可能会大大减少。
|
4、附partition相关参数：hive.exec.dynamic.partition（缺省false）：设置为true允许使用dynamic partition
hive.exec.dynamic.partition.mode（缺省strick）：设置dynamic partition模式（nostrict允许所有partition列都为dynamic partition，strict不允许）
hive.exec.max.dynamic.partitions.pernode （缺省100）：每一个mapreduce job允许创建的分区的最大数量，如果超过了这个数量就会报错hive.exec.max.dynamic.partitions （缺省1000）：一个dml语句允许创建的所有分区的最大数量
hive.exec.max.created.files （缺省100000）：所有的mapreduce job允许创建的文件的最大数量