sqoop 导入 mysql 0900_bin 中文乱码 sqoop从mysql导入数据到hdfs

转载

网络小墨 2023-08-30 13:33:15

文章标签 mysql sqoop hdfs 数据库 java 文章分类 MySQL 数据库

import-从MySQL中导入数据到hdfs中

从传统的关系型数据库导入HDFS、HIVE、HBASE......

MySQLToHDFS

编写脚本，保存为MySQLToHDFS.conf

sqoop执行脚本有两种方式：第一种方式：直接在命令行窗口中直接输入脚本；第二种方式是将命令封装成一个脚本文件，然后使用另一个命令执行

第一种方式：
sqoop import \
--append \
--connect jdbc:mysql://master:3306/student \
--username root \
--password 123456 \
--table student \
--m 4 \
--split-by age \
--target-dir /shujia/bigdata17/student1/ \
--fields-terminated-by '\t'

MyExercise:

sqoop import \
--append \
--connect jdbc:mysql://master:3306/students \
--username root \
--password 123456 \
--table student \
--m 2 \
--split-by age \
--target-dir /shujia/bigdata17/student1/ \
--fields-terminated-by '\t'



第二种方式：
import
--append
--connect
jdbc:mysql://master:3306/student
--username
root
--password
123456
--table
student
--m
4
--split-by
age
--target-dir
/shujia/bigdata17/student21/
--fields-terminated-by
','

第二种方式：myexercise:
import 
--append
--connect
jdbc:mysql://master:3306/students
--username
root
--password
123456
--table
student
--m
2
--split-by
age
--target-dir
/shujia/bigdata17/student2/
--fields-terminated-by
'\t'

执行脚本

sqoop --options-file MySQLToHDFS.conf

注意事项：

1、--m 表示指定生成多少个Map任务，不是越多越好，因为MySQL Server的承载能力有限

2、当指定的Map任务数>1，那么需要结合--split-by参数，指定分割键，以确定每个map任务到底读取哪一部分数据，最好指定数值型的列，最好指定主键(或者分布均匀的列=>避免每个map任务处理的数据量差别过大)

3、如果指定的分割键数据分布不均，可能导致数据倾斜问题

4、分割的键最好指定数值型的，而且字段的类型为int、bigint这样的数值型

5、编写脚本的时候，注意：例如：--username参数，参数值不能和参数名同一行

--username root  // 错误的

// 应该分成两行
--username
root

6、运行的时候会报错InterruptedException，hadoop2.7.6自带的问题，忽略即可

21/01/25 14:32:32 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1252)
	at java.lang.Thread.join(Thread.java:1326)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:716)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:476)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:652)

7、实际上sqoop在读取mysql数据的时候，用的是JDBC的方式，所以当数据量大的时候，效率不是很高

8、sqoop底层通过MapReduce完成数据导入导出，只需要Map任务，不许需要Reduce任务 part-m-00000

9、每个Map任务会生成一个文件

MySQLToHive

先会将MySQL的数据导出来并在HDFS上找个目录临时存放，默认为：/user/用户名/表名
然后再将数据加载到Hive中，加载完成后，会将临时存放的目录删除

编写脚本，并保存为MySQLToHive.conf文件

import
--connect
jdbc:mysql://master:3306/student
--username
root
--password
123456
--table
score
--m
3
--split-by
student_id
--fields-terminated-by
'\t'
--hive-import
--hive-overwrite
--create-hive-table
--hive-database
sqooptest
--hive-table
mysqltoscore

这个是自带的，如果加上，速度会变快。
--direct

执行脚本

sqoop --options-file MySQLToHive.conf

--direct

加上这个参数，可以在导出MySQL数据的时候，使用MySQL提供的导出工具mysqldump，加快导出速度，提高效率

需要将master上的/usr/bin/mysqldump分发至 node1、node2的/usr/bin目录下

scp /usr/bin/mysqldump node1:/usr/bin/
scp /usr/bin/mysqldump node2:/usr/bin/

--e参数的使用

sqoop在导入数据时，可以使用--e搭配sql来指定查询条件，并且还需在sql中添加$CONDITIONS，来实现并行运行mr的功能。
只要有--e+sql，就需要加$CONDITIONS，哪怕只有一个maptask。
sqoop通过继承hadoop的并行性来执行高效的数据传输。为了帮助sqoop将查询拆分为多个可以并行传输的块，需要在查询的where子句中包含$conditions占位符。 sqoop将自动用生成的条件替换这个占位符，这些条件指定每个任务应该传输哪个数据片。

import
--connect
jdbc:mysql://master:3306/student
--username
root
--password
123456
--m
2
--split-by
student_id
--e
"select * from score where student_id=1500100001 and $CONDITIONS"
--target-dir
/testE
--fields-terminated-by
'\t'
--hive-import
--hive-overwrite
--create-hive-table
--hive-database
sqooptest
--hive-table
mysqltoscore3
--direct

MySQLToHBase

编写脚本，并保存为MySQLToHBase.conf

sqoop1.4.6 只支持 HBase1.0.1 之前的版本的自动创建 HBase 表的功能

import
--connect
jdbc:mysql://master:3306/student
--username
root
--password
123456
--table
student
--hbase-table
studentsq
--column-family
cf1
--hbase-row-key
id
--m
1