hive 分区表怎么插入数据 hive导入分区表

转载

mob6454cc7aec82 2023-07-14 16:18:41

文章标签 hive 分区表怎么插入数据 hive 分区表 hadoop 文章分类 Hive 大数据

对于离线批处理的方式，如何把数据插入进hive表；案例：传统关系型数据库的BI人员转大数据，表里面差一列，使用insert插入，就会导致一大堆的小文件。hive支持insert，只支持一条一条记录插入。不建议采用hdfs上下载表的方式，可以采用sqoop.

一、分区表（partition table）

eg.每个用户进行的每一个操作都有操作日志，便于追踪；我们拨打10086，点击1、2、3会跳转不同的界面；还会根据电话高级程度由不同人员接听
这些日志记录会存入关系型数据库（RDBMS）
所有记录都存在一张表中夯不住（等死ing）

话务记录、日志记录、rdbms

记录表根据每天日期进行分表：
call_record_20190411
call_record_201904112
call_record_20190413

在hive中，也是如此叫分区表
/usr/hive/warehouse/emp/d=20190412
/usr/hive/warehouse/emp/d=20190413
我们查询记录的时候这么查询：
select xxx from table where d=‘20190412’

比如我们要查这个话务记录中的某一天的日志情况：
如果我们不做分区表，所有数据都存再一张表中，
先检索所有数据再去where条件判断d=20190412；
做了分区表的话，可以查询分区，后者的性能提升是非常大的。

在大数据中非常多的瓶颈是在IO上面的，（1）、磁盘IO (2)、网络IO
会减少磁盘IO，因为给定了目录。

二、分区表（partition table）

只是在创建表时有些区别：
先创建一个分区表：

1、创建订单分区表：
create table order_partition(
order_no string,
event_time string
)PARTITIONED BY(event_month string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

2、加载数据进去
load data local inpath '/home/hadoop/data/order.txt' overwrite into table order_partition PARTITION (event_month = '2019-04');

3、查询数据
select * from order_partition;
hive (ruozeg6)> select * from order_partition;
OK
10703430736748  2018-04-01 06:01:15     2019-04
10703130731234  2018-04-02 07:01:15     2019-04
10703230736723  2018-04-03 08:01:15     2019-04
10703540736722  2018-04-04 09:01:15     2019-04
10703230736734  2018-04-05 10:01:15     2019-04
Time taken: 0.348 seconds, Fetched: 5 row(s)

4、我们测试在hdfs上的直接创建分区目录并且上传文件是否在hive中能查询到信息：

4.1、[hadoop@hadoop004 data]$ hadoop fs -mkdir -p /user/hive/warehouse/ruozeg6.db/order_partition/event_month=2019-05

4.2、[hadoop@hadoop004 data]$ hadoop fs -put order.txt /user/hive/warehouse/ruozeg6.db/order_partition/event_month=2019-05

4.3、hive (ruozeg6)> select * from order_partition;
OK
10703430736748  2018-04-01 06:01:15     2019-04
10703130731234  2018-04-02 07:01:15     2019-04
10703230736723  2018-04-03 08:01:15     2019-04
10703540736722  2018-04-04 09:01:15     2019-04
10703230736734  2018-04-05 10:01:15     2019-04
Time taken: 0.348 seconds, Fetched: 5 row(s)

思考：之前内部表的时候，在hdfs上上传文件后是直接能读出来的，在这边创建目录后数据独不到？因为手工创建的目录分区表还不知道。

5、刷新分区信息命令
add partitions syntax:

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
 
partition_spec:
  : (partition_column = partition_col_value, partition_column = partition_col_value, ...)
alter table order_partition add if not exists partition (event_month = '2019-05')

6、刷新完后在mysql中，hive中都能查到分区信息：
mysql中查看分区表信息：select * from partitions;
select * from tbls \G;

hive中也能查到新的分区信息：select * from order_partition;
使用show partitions order_partition;		//查询该分区信息

三、创建多级、多层分区

1、构建多级分区：
create table order_mult_partition(
order_no string,
event_time string
)
PARTITIONED BY(event_month string, step string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

2、数据上载：
load data local inpath '/home/hadoop/data/order.txt' overwrite into table  order_multi_partition (event_month='2019-04',step = 1);

3、hive下查看内容：
建议：写sql时，分区目录要写到最底层。写到最里层目录
select * from order_mult_partition where event_month='2014-05' and step = 1;

hive (ruozeg6)> select * from order_mult_partition where event_month='2014-05' and step = 1;
OK
10703430736748  2018-04-01 06:01:15     2014-05 1
10703430736748  2018-04-01 06:01:15     NULL    2014-05 1
10703430736748  2018-04-01 06:01:15     NULL    2014-05 1
10703430736748  2018-04-01 06:01:15     NULL    2014-05 1
Time taken: 0.13 seconds, Fetched: 4 row(s)

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions 找到add partition

总结：hive的数据一部分是在hdfs上，一部分是在关系型数据库中（元数据信息），如果元数据中没有这个信息，那我们永远查不到。

hive 分区表怎么插入数据 hive导入分区表_hive

load语法：load syntax

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
 
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)

分割线上面的都是静态分区：