标签(空格分隔): 协作框架
一:datax 概述
1.1 datax 介绍
1.1、什么使datax
DataX 是阿里巴巴开源的一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、
HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。
1.2、datax的设计
为了解决异构数据源同步问题,DataX将复杂的网状的同步链路变成了星型数据链路,
DataX作为中间传输载体负责连接各种数据源。当需要接入一个新的数据源的时候,
只需要将此数据源对接到DataX,便能跟已有的数据源做到无缝数据同步。
1.3 datax 的架构
1.4 运行原理
二、快速入门
2.1、官方地址
下载地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz 源码地址:https://github.com/alibaba/DataX
mkdir -p /opt/bigdata/
cd /opt/bigdata/
wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
tar -zxvf datax.tar.gz
cd /opt/bigdata/datax/bin/
python datax.py /opt/bigdata/datax/job/job.json
----
报错
com.alibaba.datax.common.exception.DataXException: Code:[Common-00],
Describe:[您提供的配置文件存在错误信息,请检查您的作业配置 .] - 配置信息错误,
您提供的配置文件[/opt/datax/plugin/reader/._drdsreader/plugin.json]不存在.
请检查您的配置文件.
cd /opt/bigdata/datax/plugin/reader/
---
#reader的删除项
rm -rf ._hdfsreader
rm -rf ._otsstreamreader
rm -rf ._otsreader
rm -rf ._txtfilereader
rm -rf ._ftpreader
rm -rf ._streamreader
rm -rf ._odpsreader
rm -rf ._cassandrareader
rm -rf ._hbase11xreader
rm -rf ._oraclereader
rm -rf ._postgresqlreader
rm -rf ._mysqlreader
rm -rf ._rdbmsreader
rm -rf ._mongodbreader
rm -rf ._ossreader
rm -rf ._sqlserverreader
rm -rf ._hbase094xreader
rm -rf ._drdsreader
---
cd /opt/bigdata/datax/plugin/writer
---
#writer的删除项
rm -rf ._hbase11xsqlwriter
rm -rf ._ocswriter
rm -rf ._adswriter
rm -rf ._drdswriter
rm -rf ._hbase11xwriter
rm -rf ._hbase094xwriter
rm -rf ._sqlserverwriter
rm -rf ._osswriter
rm -rf ._mongodbwriter
rm -rf ._rdbmswriter
rm -rf ._mysqlwriter
rm -rf ._postgresqlwriter
rm -rf ._oraclewriter
rm -rf ._cassandrawriter
rm -rf ._odpswriter
rm -rf ._streamwriter
rm -rf ._ftpwriter
rm -rf ._txtfilewriter
rm -rf ._otswriter
rm -rf ._hdfswriter
---
再次执行:
cd /opt/bigdata/datax/bin/
python datax.py /opt/bigdata/datax/job/job.json
三、使用案例
3.1 从stream流读取数据并打印到控制台
1)查看配置模板
python datax.py -r streamreader -w streamwriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
Please refer to the streamreader document:
https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md
Please refer to the streamwriter document:
https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column": [],
"sliceRecordCount": ""
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
2)根据模板编写配置文件
cd /opt/bigdata/datax/job
vim stream2stream.json
----
填写以下内容:
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 10,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 1
}
}
}
}
----
运行:
/opt/bigdata/datax/bin/datax.py /opt/bigdata/datax/job/stream2stream.json
3.2 读取MySQL中的数据存放到HDFS
3.2.1 查看官方模板
python /opt/bigdata/datax/bin/datax.py -r mysqlreader -w hdfswriter
mysqlreader参数解析:
hdfswriter参数解析:
3.2.2 准备数据
1)创建student表
mysql> create database datax;
mysql> use datax;
mysql> create table student(id int,name varchar(20));
2)插入数据
mysql> insert into student values(1001,'zhangsan'),(1002,'lisi'),(1003,'wangwu');
mysql> insert into student values(1004,'qq'),(1005,'yy'),(1006,'zz');
3.2.3 编写配置文件
cdh 上面执行:
cd /opt/bigdata/datax/job/
vim mysql2hdfs.json
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [
"id",
"name"
],
"connection": [
{
"jdbcUrl": [
"jdbc:mysql://172.30.10.11:3306/datax"
],
"table": [
"student"
]
}
],
"username": "root",
"password": "flyfish225"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name": "id",
"type": "int"
},
{
"name": "name",
"type": "string"
}
],
"defaultFS": "hdfs://172.30.10.11:8020",
"fieldDelimiter": "\t",
"fileName": "student.txt",
"fileType": "text",
"path": "/",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
3.2.4 执行任务
su - hdfs
cd /opt/bigdata/datax/
bin/datax.py job/mysql2hdfs.json