mysql导入到hbase mysql导入到hive翻几倍

转载

mob64ca14193248 2023-08-21 15:38:56

文章标签 mysql导入到hbase hive hdfs mysql 文章分类 MySQL 数据库

文章目录

一、说明
二、Hive参数

2.1 单表导入

2.1.1 默认参数
2.1.2 根据数据筛选条件导入
2.1.3 设置导入后的表名
2.1.4 更改表的某个字段类型
2.1.5 导入指定库
2.1.6 增量导入

2.1.6.1 Append模式
2.1.6.2 lastModify方式（经测试，导入hive不支持时间戳，导入HDFS时可以）

2.1.7 根据指定字段导入
2.1.8 根据SQL查询结果导入

2.1.8.1 两表链接导入

2.2 全表导入

2.2.1 全表导入，并排除某张表

2.3 sqoop脚本

https://zhuanlan.zhihu.com/p/39113492

一、说明

MySQL数据导入Hive方法与导入HDFS类似：，Sqoop只是将数据导入HDFS后再转成Hive表

二、Hive参数

Hive除了自己的参数，同样也支持import的参数，见导入HDFS文章

--hive-home <dir>			Hive 的安装目录，可以通过该参数覆盖掉默认的 hive 目录
--hive-import				将表导入Hive
--hive-overwrite			覆盖Hive表中的现有数据
--create-hive-table			默认是 false,如果目标表已经存在了，那么创建任务会失败
--hive-table <table-name>	设置导入hive后的表名
--hive-drop-import-delims	导入到Hive时从字符串字段中 删除\ n，\ r和\ 01
--hive-delims-replacement	导入到Hive时，使用用户定义的字符串 替换字符串字段中的\ n，\ r和\ 01
--map-column-hive <arg>		生成 hive 表时，可以更改生成字段的数据类型

参考模版

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table person \
--columns "id,fname,lname" \
--hive-import \
--hive-overwrite  \
--hive-partition-key dt  \
--hive-partition-value %(yesterday)s  \
--hive-table f_fuwu.%(hive_table)s  \
--hive-drop-import-delims  \
--fields-terminated-by '%(fields_terminated)s'  \
--lines-terminated-by '\n'  \
--null-string '\\\\N'  \
--null-non-string '\\\\N'  \
-m 1

2.1 单表导入

2.1.1 默认参数

命令

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table person \
--hive-import

HDFS目录
默认在hive的家目录/user/hive/warehouse，有一个以源表名命名的目录，同样没有设置-m参数，生成了4个文件

[root@master ~]# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/person
Found 4 items
-rwxrwxrwx   3 hdfs hive   62664632 2019-08-27 15:38 /user/hive/warehouse/person/part-m-00000
-rwxrwxrwx   3 hdfs hive   63776345 2019-08-27 15:38 /user/hive/warehouse/person/part-m-00001
-rwxrwxrwx   3 hdfs hive   63778151 2019-08-27 15:38 /user/hive/warehouse/person/part-m-00002
-rwxrwxrwx   3 hdfs hive   63776312 2019-08-27 15:39 /user/hive/warehouse/person/part-m-00003

HDFS文件内容

[root@master ~]# sudo -u hdfs hadoop fs -cat /user/hive/warehouse/person/part-m-00000 | head -n 20
1甲子鼠180
2乙丑牛null1
3丙寅虎640
4丁卯兔490
5戊辰龙null0
6己巳蛇660
7庚午马491
8辛未羊141
9壬申猴450
10癸酉鸡881
11戌狗240
12亥猪780
13311
14llkkbllkkbd300
15llkkllkk540
16llkkbllkkbd131
17llkkllkk71
18llkkllkk120
19llkkcllkkc841
20llkkfllkkf471
cat: Unable to write to output stream.

Hive表内容
生成了相同名称的表

hive> show tables;
OK
person
Time taken: 0.028 seconds, Fetched: 1 row(s)
hive> select * from person where id <13;
OK
1	甲	子鼠		18		0
2	乙	丑牛		NULL	1
3	丙	寅虎		64		0
4	丁	卯兔		49		0
5	戊	辰龙		NULL	0
6	己	巳蛇		66		0
7	庚	午马		49		1
8	辛	未羊		14		1
9	壬	申猴		45		0
10	癸	酉鸡		88		1
11		戌狗		24		0
12		亥猪		78		0
Time taken: 16.497 seconds, Fetched: 12 row(s)

2.1.2 根据数据筛选条件导入

导入id小于11的数据
命令

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table person \
--hive-import \
--hive-overwrite \
-m 1 \
--where "id<11"

HDFS目录

[root@master ~]# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/person
Found 2 items
-rwxrwxrwx   3 hdfs hive          0 2019-08-27 16:34 /user/hive/warehouse/person/_SUCCESS
-rwxrwxrwx   3 hdfs hive        185 2019-08-27 16:34 /user/hive/warehouse/person/part-m-00000

HDFS文件内容

[root@master ~]# sudo -u hdfs hadoop fs -cat /user/hive/warehouse/person/part-m-00000 | head -n 20
1甲子鼠180
2乙丑牛null1
3丙寅虎640
4丁卯兔490
5戊辰龙null0
6己巳蛇660
7庚午马491
8辛未羊141
9壬申猴450
10癸酉鸡881

Hive表内容

hive> select * from person;
OK
1	甲	子鼠		18		0
2	乙	丑牛		NULL	1
3	丙	寅虎		64		0
4	丁	卯兔		49		0
5	戊	辰龙		NULL	0
6	己	巳蛇		66		0
7	庚	午马		49		1
8	辛	未羊		14		1
9	壬	申猴		45		0
10	癸	酉鸡		88		1
Time taken: 0.357 seconds, Fetched: 10 row(s)

2.1.3 设置导入后的表名

设置表名为person_hive： --hive-table person_hive命令

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table person \
--hive-import \
--hive-overwrite \
-m 1 \
--where "id<11" \
--hive-table person_hive

HDFS目录

[root@master ~]# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/
Found 2 items
drwxrwxrwx   - hdfs hive          0 2019-08-27 17:18 /user/hive/warehouse/person
drwxrwxrwx   - hdfs hive          0 2019-08-27 17:04 /user/hive/warehouse/person_hive

HDFS文件内容

[root@master ~]# sudo -u hdfs hadoop fs -cat /user/hive/warehouse/person/part-m-00000 | head -n 20
1甲子鼠180
2乙丑牛null1
3丙寅虎640
4丁卯兔490
5戊辰龙null0
6己巳蛇660
7庚午马491
8辛未羊141
9壬申猴450
10癸酉鸡881

Hive表内容

hive> show tables;
OK
person
person_hive
Time taken: 0.131 seconds, Fetched: 2 row(s)
hive> select * from person_hive;
OK
1	甲	子鼠		18		0
2	乙	丑牛		NULL	1
3	丙	寅虎		64		0
4	丁	卯兔		49		0
5	戊	辰龙		NULL	0
6	己	巳蛇		66		0
7	庚	午马		49		1
8	辛	未羊		14		1
9	壬	申猴		45		0
10	癸	酉鸡		88		1
Time taken: 0.066 seconds, Fetched: 10 row(s)

2.1.4 更改表的某个字段类型

更改sex字段的类型为STRING

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table person \
--map-column-hive sex=STRING \
--hive-import

2.1.5 导入指定库

使用--hive-database参数

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table person \
--hive-import \
--hive-database liuli \
-m 1

2.1.6 增量导入

--check-column：指定增量导入的依赖字段，通常为自增的主键id或者时间戳
--incremental：指定导入的模式（append或lastmodified）
--last-value：指定导入的上次最大值也就是这次开始的值

2.1.6.1 Append模式

指定一个自增字段，根据指定的自增字段值增量导入(经测试，指定超前的ID不会覆盖)

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table cars \
--hive-import \
--hive-database liuli \
--incremental append \
--check-column id \
--last-value 5 \
-m 1

2.1.6.2 lastModify方式（经测试，导入hive不支持时间戳，导入HDFS时可以）

指定一个时间戳字段，根据指定的时间戳增量导入

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table cars \
--hive-import \
--hive-database liuli \
--incremental lastmodified  \
--check-column dtime \
--last-value "2014-11-09 21:00:00" \
-m 1

2.1.7 根据指定字段导入

指定导入id,dname

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--table cars \
--hive-import \
--hive-database liuli \
--hive-overwrite \
--columns id,dname \
-m 1

2.1.8 根据SQL查询结果导入

使用了自定义sql就不能指定--table
自定义sql语句的where条件中必须包含字符串$CONDITIONS，$CONDITIONS是一个变量，用于给多个map任务划分任务范围
使用自定义sql时，如果通过参数-m指定多个map任务，由于自定义sql中可能存在多表查询，因此必须使用参数“--split-by 表名.字段名”指定多个map任务分割数据的根据字段，如--split-by users.id
提示必须要加--target-dir 参数，但是加上之后有没有导入指定的目录，暂时未解

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--hive-import \
--hive-database liuli \
--hive-table cars \
--query 'SELECT id,dname,cname FROM cars where id < 8 AND $CONDITIONS' \
--target-dir /usr/1 \
-m 1

2.1.8.1 两表链接导入

会将查询出的数据导入一张表
注意给字段取别名，不然会报错字段重复

sudo -u hdfs \
sqoop import \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--hive-import \
--hive-database liuli \
--hive-table cars \
--query 'SELECT a.id AS aid,a.dname AS adname,a.cname AS acname,b.id AS bid,b.dname AS bdname,b.cname AS bcname FROM cars a LEFT JOIN cars2 b ON a.id = b.id  AND $CONDITIONS' \
--target-dir /usr/1 \
-m 1

2.2 全表导入

满足三个条件：
1、每个表必须都只有一个列作为主键；
2、必须将每个表中所有的数据导入，而不是部分；
3、你必须使用默认分隔列，且WHERE子句无任何强加的条件

不能使用的参数

--table
--split-by
--columns
--where

--exclude-tables：可以用来排除导入某个表

命令

sudo -u hdfs \
sqoop import-all-tables \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--hive-import \
--hive-overwrite \
-m 1

HDFS目录

[root@master ~]# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/
Found 3 items
drwxrwxrwx   - hdfs hive          0 2019-08-27 17:17 /user/hive/warehouse/cars
drwxrwxrwx   - hdfs hive          0 2019-08-27 17:18 /user/hive/warehouse/person
drwxrwxrwx   - hdfs hive          0 2019-08-27 17:04 /user/hive/warehouse/person_hive

Hive表内容

hive> show tables;
OK
cars
person
person_hive
Time taken: 0.955 seconds, Fetched: 3 row(s)

2.2.1 全表导入，并排除某张表

命令
如需排除多表，用英文逗号隔开

sudo -u hdfs \
sqoop import-all-tables \
--connect jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false \
--username root \
--password xxxxxxxx \
--hive-import \
--hive-overwrite \
--exclude-tables person \
-m 1

2.3 sqoop脚本

使用--options-file参数可以执行已经编写好的脚本

脚本示例

import
--connect
jdbc:mysql://192.168.0.20:23306/test_mysql?tinyInt1isBit=false
--username
root
--password
xxxxxxxx
--table
person
--hive-import
--hive-overwrite
-m
1
--hive-table
person_hive

命令示例

一定确定文件权限HDFS用户可用

sudo -u hdfs sqoop --options-file /hive/hive.txt

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：成为python程序员要学哪些学python能当程序员吗

下一篇：python爬虫怎么做 python爬虫简单

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯