下面是hive基本练习,持续补充中。
Hive导出数据有几种方式,如何导出数据
1.insert
# 分为导出到本地或者hdfs,还可以格式化输出,指定分隔符
# 导出到本地
0: jdbc:hive2://node01:10000> insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu;
INFO : Compiling command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394): insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394); Time taken: 0.107 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394): insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu
INFO : Query ID = hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : Starting Job = job_1573910690864_0002, Tracking URL = http://node01:8088/proxy/application_1573910690864_0002/
INFO : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job -kill job_1573910690864_0002
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2019-11-16 22:19:40,957 Stage-1 map = 0%, reduce = 0%
INFO : 2019-11-16 22:19:42,002 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.51 sec
INFO : MapReduce Total cumulative CPU time: 1 seconds 510 msec
INFO : Ended Job = job_1573910690864_0002
INFO : Starting task [Stage-0:MOVE] in serial mode
INFO : Copying data to local directory /kkb/install/hivedatas/stu3 from hdfs://node01:8020/tmp/hive/anonymous/2d04ba8e-9799-4a31-a93d-557db4086e81/hive_2019-11-16_22-19-32_776_5008666227900564137-1/-mr-10000
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Cumulative CPU: 1.51 sec HDFS Read: 3381 HDFS Write: 285797 SUCCESS
INFO : Total MapReduce CPU Time Spent: 1 seconds 510 msec
INFO : Completed executing command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394); Time taken: 10.251 seconds
INFO : OK
No rows affected (10.383 seconds)
# 查看本地文件
[hadoop@node01 /kkb/install/hivedatas/stu3]$ cat 000000_0
1clyang
# 导出到hdfs
0: jdbc:hive2://node01:10000> insert overwrite directory '/kkb/stu' select * from stu;
INFO : Compiling command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852): insert overwrite directory '/kkb/stu' select * from stu
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852); Time taken: 0.173 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852): insert overwrite directory '/kkb/stu' select * from stu
INFO : Query ID = hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852
INFO : Total jobs = 3
INFO : Launching Job 1 out of 3
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : Starting Job = job_1573910690864_0003, Tracking URL = http://node01:8088/proxy/application_1573910690864_0003/
INFO : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job -kill job_1573910690864_0003
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2019-11-16 22:24:13,962 Stage-1 map = 0%, reduce = 0%
INFO : 2019-11-16 22:24:15,018 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.46 sec
INFO : MapReduce Total cumulative CPU time: 1 seconds 460 msec
INFO : Ended Job = job_1573910690864_0003
INFO : Starting task [Stage-6:CONDITIONAL] in serial mode
INFO : Stage-3 is selected by condition resolver.
INFO : Stage-2 is filtered out by condition resolver.
INFO : Stage-4 is filtered out by condition resolver.
INFO : Starting task [Stage-3:MOVE] in serial mode
INFO : Moving data to: hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10000 from hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10002
INFO : Starting task [Stage-0:MOVE] in serial mode
INFO : Moving data to: /kkb/stu from hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10000
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Cumulative CPU: 1.46 sec HDFS Read: 3315 HDFS Write: 286719 SUCCESS
INFO : Total MapReduce CPU Time Spent: 1 seconds 460 msec
INFO : Completed executing command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852); Time taken: 9.044 seconds
INFO : OK
# 查看hdfs
[hadoop@node01 /kkb/install/hivedatas/stu3]$ hdfs dfs -cat /kkb/stu/000000_0
19/11/16 22:26:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1clyang
# 可以指定导出本地格式化分隔符,以导出到本地为例
0: jdbc:hive2://node01:10000> insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu;
INFO : Compiling command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f): insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f); Time taken: 0.128 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f): insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu
INFO : Query ID = hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : Starting Job = job_1573910690864_0005, Tracking URL = http://node01:8088/proxy/application_1573910690864_0005/
INFO : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job -kill job_1573910690864_0005
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2019-11-16 22:31:27,083 Stage-1 map = 0%, reduce = 0%
INFO : 2019-11-16 22:31:28,139 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.93 sec
INFO : MapReduce Total cumulative CPU time: 1 seconds 930 msec
INFO : Ended Job = job_1573910690864_0005
INFO : Starting task [Stage-0:MOVE] in serial mode
INFO : Copying data to local directory /kkb/install/hivedatas/stu4 from hdfs://node01:8020/tmp/hive/anonymous/2d04ba8e-9799-4a31-a93d-557db4086e81/hive_2019-11-16_22-31-20_415_1737902713220629568-1/-mr-10000
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Cumulative CPU: 1.93 sec HDFS Read: 3526 HDFS Write: 286073 SUCCESS
INFO : Total MapReduce CPU Time Spent: 1 seconds 930 msec
INFO : Completed executing command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f); Time taken: 8.707 seconds
INFO : OK
# 查看本地文件,发现以@分隔
[hadoop@node01 /kkb/install/hivedatas/stu4]$ cat 000000_0
1@clyang
2.hadoop命令
数据使用hive保存后存在于hdfs,也可以直接从hdfs将数据拉到本地,使用get命令。
hdfs dfs -get /user/hive/warehouse/student/student.txt /opt/bigdata/data
3.bash shell覆盖追加导出
使用bin/hive -e sql语句或者bin/hive -f sql脚本,将数据覆盖或者追加导出,这里以前者为例,另外sql脚本本质上主要还是sql语句。
# 覆盖写
[hadoop@node01 /kkb/install/hive-1.1.0-cdh5.14.2/bin]$ ./hive -e 'select * from db_hive.stu' > /kkb/install/hivedatas/student2.txt
ls: cannot access /kkb/install/spark/lib/spark-assembly-*.jar: No such file or directory
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/kkb/install/hbase-1.2.0-cdh5.14.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/kkb/install/hadoop-2.6.0-cdh5.14.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-11-16 22:37:46,342 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/11/16 22:37:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Logging initialized using configuration in file:/kkb/install/hive-1.1.0-cdh5.14.2/conf/hive-log4j.properties
OK
Time taken: 6.966 seconds, Fetched: 1 row(s)
You have new mail in /var/spool/mail/root
# 查看结果
[hadoop@node01 /kkb/install/hivedatas]$ cat student2.txt
stu.id stu.name
1 clyang
# 追加写
[hadoop@node01 /kkb/install/hive-1.1.0-cdh5.14.2/bin]$ ./hive -e 'select * from db_hive.stu' >> /kkb/install/hivedatas/student2.txt
ls: cannot access /kkb/install/spark/lib/spark-assembly-*.jar: No such file or directory
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/kkb/install/hbase-1.2.0-cdh5.14.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/kkb/install/hadoop-2.6.0-cdh5.14.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-11-16 22:39:03,442 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/11/16 22:39:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Logging initialized using configuration in file:/kkb/install/hive-1.1.0-cdh5.14.2/conf/hive-log4j.properties
OK
Time taken: 6.056 seconds, Fetched: 1 row(s)
You have new mail in /var/spool/mail/root
# 查看追加写后结果
[hadoop@node01 /kkb/install/hivedatas]$ cat student2.txt
stu.id stu.name
1 clyang
stu.id stu.name
1 clyang
4.export导出到hdfs
# 导出
0: jdbc:hive2://node01:10000> export table stu to '/kkb/studentexport';
INFO : Compiling command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a): export table stu to '/kkb/studentexport'
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a); Time taken: 0.126 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a): export table stu to '/kkb/studentexport'
INFO : Starting task [Stage-0:COPY] in serial mode
INFO : Copying data from file:/tmp/hadoop/e951940a-bcb6-4cd4-be17-0baf5d13615f/hive_2019-11-05_09-43-30_802_7299251851779747447-1/-local-10000/_metadata to hdfs://node01:8020/kkb/studentexport
INFO : Copying file: file:/tmp/hadoop/e951940a-bcb6-4cd4-be17-0baf5d13615f/hive_2019-11-05_09-43-30_802_7299251851779747447-1/-local-10000/_metadata
INFO : Starting task [Stage-1:COPY] in serial mode
INFO : Copying data from hdfs://node01:8020/user/hive/warehouse/db_hive.db/stu to hdfs://node01:8020/kkb/studentexport/data
INFO : Copying file: hdfs://node01:8020/user/hive/warehouse/db_hive.db/stu/000000_0
INFO : Completed executing command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a); Time taken: 0.604 seconds
INFO : OK
# 查看数据
[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /kkb/studentexport
19/11/17 20:29:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rwxr-xr-x 3 anonymous supergroup 1330 2019-11-05 09:43 /kkb/studentexport/_metadata
drwxr-xr-x - anonymous supergroup 0 2019-11-05 09:43 /kkb/studentexport/data
[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /kkb/studentexport/data
19/11/17 20:29:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rwxr-xr-x 3 anonymous supergroup 9 2019-11-05 09:43 /kkb/studentexport/data/000000_0
You have new mail in /var/spool/mail/root
[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -cat /kkb/studentexport/data/000000_0
19/11/17 20:29:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1clyang
分区和分桶的区别
分区是文件夹范畴的,就是按照文件夹区分,来存储文件,分桶是文件范畴的,将一个文件根据某个字段按hash取余,拆分为几个文件片段保存,它们都有各自的应用场景:
(1)分区用在按照日期,按照天,或者小时来保存数据,后面查询可以根据需求快速定位到数据,避免了速度慢的全表扫描查询。
(2)分桶则是更加细粒度的存储,可以指定桶的个数n,这样一份文件保存会划分为n份,如果想快速查找可以用tablesample(bucket x out of y)来指定要抽样查询的桶表。
另外分区表里面可能有分桶表。
将数据直接上传到分区目录(hdfs)上,让分区表和数据产生关联有哪些方式?
当创建分区表并将数据导入到分区后,发现导入的数据就保存在对应的分区目录下,并且可以正常查询表内容。如果先将数据导入到事先准备好的分区,然后再创建分区表,是查不到数据的,因为还没有建立分区表数据和hive表的映射关系,需要使用命令来修复,此外还有2种方法。
方法1 msck repair table 表名
提前准备好分区,并将数据上传。
[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /mystudentdatas/month=11/
19/11/17 12:36:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r-- 3 hadoop supergroup 199 2019-11-17 12:36 /mystudentdatas/month=11/student.csv
创建表格
0: jdbc:hive2://node01:10000> create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/mystudentdatas';
INFO : Compiling command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4): create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/mystudentdatas'
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4); Time taken: 0.149 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4): create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/mystudentdatas'
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4); Time taken: 0.271 seconds
INFO : OK
修复表格,使用msck,修复后就可以查看到表的数据了,映射关系建立。
# 修复表格
0: jdbc:hive2://node01:10000> msck repair table student_partition_me;
INFO : Compiling command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631): msck repair table student_partition_me
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631); Time taken: 0.011 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631): msck repair table student_partition_me
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631); Time taken: 0.263 seconds
INFO : OK
No rows affected (0.311 seconds)
# 查询,最后字段为分区字段month
0: jdbc:hive2://node01:10000> select id,name,year,gender,month from student_partition_me;
INFO : Compiling command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a): select id,name,year,gender,month from student_partition_me
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:year, type:string, comment:null), FieldSchema(name:gender, type:string, comment:null), FieldSchema(name:month, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a); Time taken: 0.133 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a): select id,name,year,gender,month from student_partition_me
INFO : Completed executing command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a); Time taken: 0.0 seconds
INFO : OK
+-----+-------+-------------+---------+--------+--+
| id | name | year | gender | month |
+-----+-------+-------------+---------+--------+--+
| 01 | 赵雷 | 1990-01-01 | 男 | 11 |
| 02 | 钱电 | 1990-12-21 | 男 | 11 |
| 03 | 孙风 | 1990-05-20 | 男 | 11 |
| 04 | 李云 | 1990-08-06 | 男 | 11 |
| 05 | 周梅 | 1991-12-01 | 女 | 11 |
| 06 | 吴兰 | 1992-03-01 | 女 | 11 |
| 07 | 郑竹 | 1989-07-01 | 女 | 11 |
| 08 | 王菊 | 1990-01-20 | 女 | 11 |
+-----+-------+-------------+---------+--------+--+
8 rows selected (0.214 seconds)
方法2 alter table 表名 add partition(col=xxx)
将数据上传到hdfs
# 注意这里hdfs数据目录换成studentdatas
[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /studentdatas/month=12/
19/11/17 16:51:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r-- 3 hadoop supergroup 199 2019-11-17 16:51 /studentdatas/month=12/student.csv
创建表格
0: jdbc:hive2://node01:10000> create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/studentdatas';
INFO : Compiling command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699): create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/studentdatas'
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699); Time taken: 0.011 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699): create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '\t' location '/studentdatas'
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699); Time taken: 0.097 seconds
INFO : OK
使用alter table指定分区
0: jdbc:hive2://node01:10000> alter table student_partition_pa add partition(month='12');
INFO : Compiling command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662): alter table student_partition_pa add partition(month='12')
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662); Time taken: 0.051 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662): alter table student_partition_pa add partition(month='12')
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662); Time taken: 0.116 seconds
INFO : OK
查询数据,ok
0: jdbc:hive2://node01:10000> select id,name,year,gender,month from student_partition_pa;
INFO : Compiling command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb): select id,name,year,gender,month from student_partition_pa
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:year, type:string, comment:null), FieldSchema(name:gender, type:string, comment:null), FieldSchema(name:month, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb); Time taken: 0.092 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb): select id,name,year,gender,month from student_partition_pa
INFO : Completed executing command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb); Time taken: 0.001 seconds
INFO : OK
+-----+-------+-------------+---------+--------+--+
| id | name | year | gender | month |
+-----+-------+-------------+---------+--------+--+
| 01 | 赵雷 | 1990-01-01 | 男 | 12 |
| 02 | 钱电 | 1990-12-21 | 男 | 12 |
| 03 | 孙风 | 1990-05-20 | 男 | 12 |
| 04 | 李云 | 1990-08-06 | 男 | 12 |
| 05 | 周梅 | 1991-12-01 | 女 | 12 |
| 06 | 吴兰 | 1992-03-01 | 女 | 12 |
| 07 | 郑竹 | 1989-07-01 | 女 | 12 |
| 08 | 王菊 | 1990-01-20 | 女 | 12 |
+-----+-------+-------------+---------+--------+--+
8 rows selected (0.162 seconds)
方法3 load data inpath 'hdfs文件路径' into table 表名 partition(col名='xxx')
数据上传到hdfs
[hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /
19/11/17 17:21:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 13 items
# 上传person.txt到hdfs
-rw-r--r-- 3 hadoop supergroup 68 2019-11-17 17:09 /person.txt
创建表格
0: jdbc:hive2://node01:10000> create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '\t' collection items terminated by ',' location '/persondatas';
INFO : Compiling command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666): create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '\t' collection items terminated by ',' location '/persondatas'
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666); Time taken: 0.023 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666): create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '\t' collection items terminated by ',' location '/persondatas'
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666); Time taken: 0.101 seconds
INFO : OK
将hdfs上文件加载到分区目录下
0: jdbc:hive2://node01:10000> load data inpath '/person.txt' into table person_partition partition(age='25');
INFO : Compiling command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81): load data inpath '/person.txt' into table person_partition partition(age='25')
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81); Time taken: 0.082 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81): load data inpath '/person.txt' into table person_partition partition(age='25')
INFO : Starting task [Stage-0:MOVE] in serial mode
INFO : Loading data to table myhive.person_partition partition (age=25) from hdfs://node01:8020/person.txt
INFO : Starting task [Stage-1:STATS] in serial mode
INFO : Partition myhive.person_partition{age=25} stats: [numFiles=1, numRows=0, totalSize=68, rawDataSize=0]
INFO : Completed executing command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81); Time taken: 0.382 seconds
INFO : OK
查询数据,ok
0: jdbc:hive2://node01:10000> select * from person_partition;
INFO : Compiling command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392): select * from person_partition
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:person_partition.name, type:string, comment:null), FieldSchema(name:person_partition.citys, type:array<string>, comment:null), FieldSchema(name:person_partition.age, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392); Time taken: 0.099 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392): select * from person_partition
INFO : Completed executing command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392); Time taken: 0.001 seconds
INFO : OK
+------------------------+----------------------------------------------+-----------------------+--+
| person_partition.name | person_partition.citys | person_partition.age |
+------------------------+----------------------------------------------+-----------------------+--+
| yang | ["beijing","shanghai","tianjin","hangzhou"] | 25 |
| messi | ["changchu","chengdu","wuhan"] | 25 |
+------------------------+----------------------------------------------+-----------------------+--+
分桶表是否可以通过直接load将数据导入?
桶表需要根据某个字段进行hash取余然后拆分数据保存为不同的文件保存到hdfs,需要通过普通中间表的字段值来计算拆分,因此不能直接load导入,直接导入到hdfs只有一个文件。另外通过桶表的文件类型可以看出,它不是原来的格式了,是一个mr计算后的文件,因此也说明不能用hdfs直接导入。
hive中分区可以提高查询效率,分区是否越多越好,为什么?
hive查询本质上是执行MapReduce任务,如果分区太多,同样体量的数据会产生更多的小文件block块,则会产生的更多的元数据(块的位置、大小等信息),这样对namenode来说压力很大。
另外hive sql会转化为mapreduce任务,分区的一个小文件会对应一个的task,一个task对应一个JVM实例,过多的分区会产生大量的JVM实例,导致JVM频繁的创建与销毁,会降低系统整体性能。