Hive 五种数据导入方式介绍

原创

zhongqi2513 2023-04-30 07:03:34 ©著作权

文章标签 hive html Time 文章分类 JavaScript 前端开发

©著作权归作者所有：来自51CTO博客作者zhongqi2513的原创作品，请联系作者获取转载授权，否则将追究法律责任

问题导读：

1、Hive五种数据导入方式有哪些？

2、导入表命令中有无LOCAL关键字的区别？

3、使用OVERWRITE或INTO关键字的作用及区别？

4、INPATH路径的限制？

5、什么是动态分区插入？

6、动态分区插入需要做哪些设置？

内容：

既然Hive没有行级别的数据插入、更新和删除操作，那么往表中装载数据的唯一途径就是使用一种”大量“的数据装载操作。我们以如下格式文件演示五种数据导入Hive方式

1. Tom         24    NanJing   Nanjing University  
2. Jack        29    NanJing   Southeast China University  
3. Mary Kake   21    SuZhou    Suzhou University  
4. John Doe    24    YangZhou  YangZhou University  
5. Bill King   23    XuZhou    Xuzhou Normal University

数据格式以\t分隔，分别表示：姓名、年龄、地址、学校

一、从本地文件系统中导入数据

(1) 创建test1测试表

1. hive> CREATE TABLE test1(name STRING,age INT, address STRING,school STRING)  
2. > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'  
3. > STORED AS TEXTFILE ;  
4. OK  
5. Time taken: 0.078 seconds

(2) 从本地加载数据

1. hive> LOAD DATA LOCAL INPATH "/home/hadoopUser/data/test1.txt"  
2. > INTO TABLE test1;  
3. Copying data from file:/home/hadoopUser/data/test1.txt  
4. Copying file: file:/home/hadoopUser/data/test1.txt  
5. Loading data to table hive.test1  
6. Table hive.test1 stats: [numFiles=1, numRows=0, totalSize=201, rawDataSize=0]  
7. OK  
8. Time taken: 0.364 seconds

(3) 查看导入结果

1. hive> select * from test1;  
2. OK  
3. Tom     24    NanJing Nanjing University  
4. Jack    29    NanJing Southeast China University  
5. Mary Kake       21    SuZhou  Suzhou University  
6. John Doe        24    YangZhou        YangZhou University  
7. Bill King       23    XuZhou  Xuzhou Normal University  
8. Time taken: 0.533 seconds, Fetched: 5 row(s)

注意：

此处使用的是LOCAL，表示从本地文件系统中加载数据到Hive中，同时没有OVERWRITE关键字，仅仅会把新增的文件增加到目标文件夹而不会删除之前的数据。如果使用OVERWRITE关键字，那么目标文件夹中之前的数据将会被先删除掉。

二、从HDFS文件系统加载数据到Hive

1. insert overwrite table test1  select * from test1 where 1=0;  //清空表，一般不推荐这样操作

(2) 从HDFS加载数据

1. hive> LOAD DATA INPATH "/input/test1.txt"  
2. > OVERWRITE INTO TABLE test1;  
3. Loading data to table hive.test1  
4. rmr: DEPRECATED: Please use 'rm -r' instead.  
5. Deleted hdfs://secondmgt:8020/hive/warehouse/hive.db/test1  
6. Table hive.test1 stats: [numFiles=1, numRows=0, totalSize=201, rawDataSize=0]  
7. OK  
8. Time taken: 0.355 seconds

(3) 查询结果

1. hive> select * from test1;  
2. OK  
3. Tom     24.0    NanJing Nanjing University  
4. Jack    29.0    NanJing Southeast China University  
5. Mary Kake       21.0    SuZhou  Suzhou University  
6. John Doe        24.0    YangZhou        YangZhou University  
7. Bill King       23.0    XuZhou  Xuzhou Normal University  
8. Time taken: 0.054 seconds, Fetched: 5 row(s)

注意：此处没有LOCAL关键字，表示分布式文件系统中的路径，这就是和第一种方法的主要区别，同时由日志可以发现，因为此处加了OVERWRITE关键字，执行了Deleted操作，即先删除之前存储的数据，然后再执行加载操作。

三、通过查询语句向表中插入数据

1. hive> CREATE TABLE test4(name STRING,age FLOAT,address STRING,school STRING)  
2. > ROW FORMAT DELIMITED  
3. > FIELDS TERMINATED BY '\t'  
4. > STORED AS TEXTFILE ;  
5. OK  
6. Time taken: 0.251 seconds

创建表过程基本和前面一样，此处不细讲

1. hive> INSERT INTO TABLE test4 SELECT * FROM test1;  
2. Total jobs = 3  
3. Launching Job 1 out of 3  
4. Number of reduce tasks is set to 0 since there's no reduce operator  
5. Starting Job = job_1419317102229_0032, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0032/  
6. Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job  -kill job_1419317102229_0032  
7. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0  
8. 2014-12-29 10:49:25,409 Stage-1 map = 0%,  reduce = 0%  
9. 2014-12-29 10:49:36,900 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.83 sec  
10. MapReduce Total cumulative CPU time: 2 seconds 830 msec  
11. Ended Job = job_1419317102229_0032  
12. Stage-4 is selected by condition resolver.  
13. Stage-3 is filtered out by condition resolver.  
14. Stage-5 is filtered out by condition resolver.  
15. Moving data to: hdfs://secondmgt:8020/hive/scratchdir/hive_2014-12-29_10-49-10_009_3039854442660229613-1/-ext-10000  
16. Loading data to table hive.test4  
17. Table hive.test4 stats: [numFiles=1, numRows=5, totalSize=211, rawDataSize=206]  
18. MapReduce Jobs Launched:  
19. Job 0: Map: 1   Cumulative CPU: 2.83 sec   HDFS Read: 415 HDFS Write: 278 SUCCESS  
20. Total MapReduce CPU Time Spent: 2 seconds 830 msec  
21. OK  
22. Time taken: 28.495 seconds

新建表的字段数，一定要和后面SELECT中查询的字段数一样，且要注意数据类型。如test4包含四个字段：name、age、address和school，则SELECT查询出的结果也应该对应这四个字段。

1. hive> select * from test4;  
2. OK  
3. Tom     24.0    NanJing Nanjing University  
4. Jack    29.0    NanJing Southeast China University  
5. Mary Kake       21.0    SuZhou  Suzhou University  
6. John Doe        24.0    YangZhou        YangZhou University  
7. Bill King       23.0    XuZhou  Xuzhou Normal University  
8. Time taken: 0.066 seconds, Fetched: 5 row(s)

四、分区插入

分区插入有两种，一种是静态分区，另一种是动态分区。如果混合使用静态分区和动态分区，则静态分区必须出现在动态分区之前。现分别介绍这两种分区插入

(1) 静态分区插入

①创建分区表

1. hive> CREATE TABLE test2(name STRING,address STRING,school STRING)  
2. > PARTITIONED BY(age float)  
3. > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'  
4. > STORED AS TEXTFILE ;  
5. OK  
6. Time taken: 0.144 seconds

此处创建了一个test2的分区表，以年龄分区

②从查询结果中导入数据

1. hive> INSERT INTO  TABLE test2 PARTITION (age='24') SELECT * FROM test1;  
2. FAILED: SemanticException [Error 10044]: Line 1:19 Cannot insert into target table because column number/types are different ''24'': Table insclause-0 has 3 columns, but query has 4 columns.

此处报了一个错误。是因为test2中是以age分区的，有三个字段，SELECT * 语句中包含有四个字段，所以出错。正确如下：

1. hive> INSERT INTO  TABLE test2 PARTITION (age='24') SELECT name,address,school FROM test1;

③ 查看插入结果

1. hive> select * from test2;  
2. OK  
3. Tom     NanJing Nanjing University      24.0  
4. Jack    NanJing Southeast China University      24.0  
5. Mary Kake       SuZhou  Suzhou University       24.0  
6. John Doe        YangZhou        YangZhou University     24.0  
7. Bill King       XuZhou  Xuzhou Normal University        24.0  
8. Time taken: 0.079 seconds, Fetched: 5 row(s)

由查询结果可知，每条记录的年龄均为24，插入成功。

(2) 动态分区插入

①

②

1. hive> set hive.exec.dynamic.partition=true;  
2. hive> set hive.exec.dynamic.partition.mode=nonstrict;

注意：动态分区默认情况下是没有开启的。开启后，默认是以”严格“模式执行的，在这种模式下要求至少有一列分区字段是静态的。这有助于阻止因设计错误导致查询产生大量的分区。但是此处我们不需要静态分区字段，估将其设为nonstrict。

③

1. hive> insert into table test2 partition (age) select name,address,school,age from test1;  
2. Total jobs = 1  
3. Launching Job 1 out of 1  
4. Number of reduce tasks not specified. Estimated from input data size: 1  
5. In order to change the average load for a reducer (in bytes):  
6. hive.exec.reducers.bytes.per.reducer=<number>  
7. In order to limit the maximum number of reducers:  
8. hive.exec.reducers.max=<number>  
9. In order to set a constant number of reducers:  
10. mapreduce.job.reduces=<number>  
11. Starting Job = job_1419317102229_0029, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0029/  
12. Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job  -kill job_1419317102229_0029  
13. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1  
14. 2014-12-28 20:45:07,996 Stage-1 map = 0%,  reduce = 0%  
15. 2014-12-28 20:45:21,488 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.67 sec  
16. 2014-12-28 20:45:32,926 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.32 sec  
17. MapReduce Total cumulative CPU time: 7 seconds 320 msec  
18. Ended Job = job_1419317102229_0029  
19. Loading data to table hive.test2 partition (age=null)  
20. age=29.0}  
21. age=23.0}  
22. age=21.0}  
23. age=24.0}  
24. Partition hive.test2{age=21.0} stats: [numFiles=1, numRows=1, totalSize=35, rawDataSize=34]  
25. Partition hive.test2{age=23.0} stats: [numFiles=1, numRows=1, totalSize=42, rawDataSize=41]  
26. Partition hive.test2{age=24.0} stats: [numFiles=1, numRows=2, totalSize=69, rawDataSize=67]  
27. Partition hive.test2{age=29.0} stats: [numFiles=1, numRows=1, totalSize=40, rawDataSize=39]  
28. MapReduce Jobs Launched:  
29. Job 0: Map: 1  Reduce: 1   Cumulative CPU: 7.32 sec   HDFS Read: 415 HDFS Write: 375 SUCCESS  
30. Total MapReduce CPU Time Spent: 7 seconds 320 msec  
31. OK  
32. Time taken: 41.846 seconds

注意：查询语句select查询出来的age字段必须放在最后，和分区字段对应，不然结果会出错

④

查看插入结果

1. hive> select * from test2;  
2. OK  
3. Mary Kake       SuZhou  Suzhou University       21.0  
4. Bill King       XuZhou  Xuzhou Normal University        23.0  
5. John Doe        YangZhou        YangZhou University     24.0  
6. Tom     NanJing Nanjing University      24.0  
7. Jack    NanJing Southeast China University      29.0

五、单个查询语句中创建表并加载数据

在实际情况中，表的输出结果可能太多，不适于显示在控制台上，这时候，将Hive的查询输出结果直接存在一个新的表中是非常方便的，我们称这种情况为CTAS（create table .. as select）

1. hive> CREATE TABLE test3  
2. > AS  
3. > SELECT name,age FROM test1;  
4. Total jobs = 3  
5. Launching Job 1 out of 3  
6. Number of reduce tasks is set to 0 since there's no reduce operator  
7. Starting Job = job_1419317102229_0030, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0030/  
8. Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job  -kill job_1419317102229_0030  
9. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0  
10. 2014-12-28 20:59:59,375 Stage-1 map = 0%,  reduce = 0%  
11. 2014-12-28 21:00:10,795 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.68 sec  
12. MapReduce Total cumulative CPU time: 2 seconds 680 msec  
13. Ended Job = job_1419317102229_0030  
14. Stage-4 is selected by condition resolver.  
15. Stage-3 is filtered out by condition resolver.  
16. Stage-5 is filtered out by condition resolver.  
17. Moving data to: hdfs://secondmgt:8020/hive/scratchdir/hive_2014-12-28_20-59-45_494_6763514583931347886-1/-ext-10001  
18. Moving data to: hdfs://secondmgt:8020/hive/warehouse/hive.db/test3  
19. Table hive.test3 stats: [numFiles=1, numRows=5, totalSize=63, rawDataSize=58]  
20. MapReduce Jobs Launched:  
21. Job 0: Map: 1   Cumulative CPU: 2.68 sec   HDFS Read: 415 HDFS Write: 129 SUCCESS  
22. Total MapReduce CPU Time Spent: 2 seconds 680 msec  
23. OK  
24. Time taken: 26.583 seconds

(2) 查看插入结果

1. hive> select * from test3;  
2. OK  
3. Tom     24.0  
4. Jack    29.0  
5. Mary Kake       21.0  
6. John Doe        24.0  
7. Bill King       23.0  
8. Time taken: 0.045 seconds, Fetched: 5 row(s)

此处结束！