简介: 阿里云开源离线同步工具DataX3.0介绍 一. DataX3.0概览 DataX 是一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。
如果不熟悉的话可以先进行了解:https://developer.aliyun.com/article/59373
源码开源地址:https://github.com/alibaba/DataX?spm=a2c6h.12873639.0.0.21084f64hM6IE9
DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图
类型 | 数据源 | Reader(读) | Writer(写) | 文档 |
RDBMS 关系型数据库 | MySQL | √ | √ | |
| Oracle | √ | √ | |
| SQLServer | √ | √ | |
| PostgreSQL | √ | √ | |
| DRDS | √ | √ | |
| 通用RDBMS(支持所有关系型数据库) | √ | √ | |
阿里云数仓数据存储 | ODPS | √ | √ | |
| ADS | | √ | |
| OSS | √ | √ | |
| OCS | √ | √ | |
NoSQL数据存储 | OTS | √ | √ | |
| Hbase0.94 | √ | √ | |
| Hbase1.1 | √ | √ | |
| Phoenix4.x | √ | √ | |
| Phoenix5.x | √ | √ | |
| MongoDB | √ | √ | |
| Hive | √ | √ | |
| Cassandra | √ | √ | |
无结构化数据存储 | TxtFile | √ | √ | |
| FTP | √ | √ | |
| HDFS | √ | √ | |
| Elasticsearch | | √ | |
时间序列数据库 | OpenTSDB | √ | | |
| TSDB | √ | √ |
1、mysql2es脚本
test.json
{
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "datax",
"password": "123456",
"where":"updated_at>='${start_time} 00:00:00' and updated_at<='${end_time} 23:59:59'",
"column": [
"id",
"app_id",
"collection_phone",
"transaction_number",
"pay_amount",
"if(auto_tags is null,'',replace(replace(replace(auto_tags,'[',''),']',''),'\"','')) as auto_tags",
"if(manual_tags is null,'',replace(replace(replace(manual_tags,'[',''),']',''),'\"','')) as manual_tags",
"if(latest_days_ordered_at is null,'',replace(replace(latest_days_ordered_at,'[',''),']','')) as latest_days_ordered_at",
"if(latest_days_paid_at is null,'',replace(replace(latest_days_paid_at,'[',''),']','')) as latest_days_paid_at",
"if(latest_days_visited_at is null,'',replace(replace(latest_days_visited_at,'[',''),']','')) as latest_days_visited_at",
"latest_ordered_at",
"visited_products",
"ordered_products"
],
"connection": [
{
"jdbcUrl": ["jdbc:mysql://127.0.0.1:3306/db_user?com.mysql.jdbc.faultInjection.serverCharsetIndex=45"],
"table": [
"user"
]
}
]
}
},
"writer": {
"name": "elasticsearchwriter",
"parameter": {
"endpoint": "http://127.0.0.1:9200",
"accessId": "elastic",
"accessKey": "123456",
"index":"user",
"type":"traces",
"settings": {"index" :{"number_of_shards": 5, "number_of_replicas": 1}},
"batchSize": 5000,
"splitter": ",",
"column": [
{"name":"pk","type":"id"},
{"name":"app_id","type":"keyword"},
{"name":"collection_phone","type":"keyword"},
{"name":"transaction_number","type":"integer"},
{"name":"pay_amount","type":"integer"},
{"name":"auto_tags","type":"keyword","array":true},
{"name":"manual_tags","type":"keyword","array":true},
{"name":"latest_days_ordered_at","type":"long","array":true},
{"name":"latest_days_paid_at","type":"long","array":true},
{"name":"latest_days_visited_at","type":"long","array":true},
{"name":"latest_ordered_at","type":"long"},
{"name":"visited_products","type":"nested"},
{"name":"ordered_products","type":"nested"}
]
}
}
}
]
}
}
2、运行datax脚本
python /usr/local/datax/bin/datax.py ./test.json -p "-Dstart_time=2020-09-02 -Dend_time=2020-09-02"
2.1 插件[mysqlreader,elasticsearchwriter]加载失败
运行完直接报错了,报错如下:
2020-09-02 15:49:33.747 [main] WARN ConfigParser - 插件[mysqlreader,elasticsearchwriter]加载失败,1s后重试... Exception:Code:[Framework-12], Description:[DataX插件初始化错误, 该问题通常是由于DataX安装错误引起,请联系您的运维解决 .]. - 插件加载失败,未完成指定插件加载:[elasticsearchwriter, mysqlreader]
2020-09-02 15:49:34.765 [main] ERROR Engine -
经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Framework-12], Description:[DataX插件初始化错误, 该问题通常是由于DataX安装错误引起,请联系您的运维解决 .]. - 插件加载失败,未完成指定插件加载:[elasticsearchwriter, mysqlreader]
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:26)
at com.alibaba.datax.core.util.ConfigParser.parsePluginConfig(ConfigParser.java:142)
at com.alibaba.datax.core.util.ConfigParser.parse(ConfigParser.java:63)
at com.alibaba.datax.core.Engine.entry(Engine.java:137)
at com.alibaba.datax.core.Engine.main(Engine.java:204)
2.2 检查是否装有mysqlreder,elasticsearchwriter插件
那既然说加载不成功,那我们就去看吗,拿数据说话
mysqlreder已存在!!
哦豁,好像真的没有 elasticsearchwriter,小点声马上去安装。。。
3、安装elasticsearchwriter组件(没装过插件的小朋友,装过的可以直接跳过)
3.1 拉取DataX项目源码到服务器 DataX-master
3.2 修改根目录下的pom.xml文件,按需修改
//原始的里面是所有很全的,不过一般都是按需install
<modules>
<module>common</module>
<module>core</module>
<module>transformer</module>
<!-- reader -->
<module>mysqlreader</module>
<module>drdsreader</module>
<module>sqlserverreader</module>
<module>postgresqlreader</module>
<module>oraclereader</module>
<module>odpsreader</module>
<module>otsreader</module>
<module>otsstreamreader</module>
<module>txtfilereader</module>
<module>hdfsreader</module>
<module>streamreader</module>
<module>ossreader</module>
<module>ftpreader</module>
<module>mongodbreader</module>
<module>rdbmsreader</module>
<module>hbase11xreader</module>
<module>hbase094xreader</module>
<module>tsdbreader</module>
<module>opentsdbreader</module>
<module>cassandrareader</module>
<module>gdbreader</module>
<!-- writer -->
<module>mysqlwriter</module>
<module>drdswriter</module>
<module>odpswriter</module>
<module>txtfilewriter</module>
<module>ftpwriter</module>
<module>hdfswriter</module>
<module>streamwriter</module>
<module>otswriter</module>
<module>oraclewriter</module>
<module>sqlserverwriter</module>
<module>postgresqlwriter</module>
<module>osswriter</module>
<module>mongodbwriter</module>
<module>adswriter</module>
<module>ocswriter</module>
<module>rdbmswriter</module>
<module>hbase11xwriter</module>
<module>hbase094xwriter</module>
<module>hbase11xsqlwriter</module>
<module>hbase11xsqlreader</module>
<module>elasticsearchwriter</module>
<module>tsdbwriter</module>
<module>adbpgwriter</module>
<module>gdbwriter</module>
<module>cassandrawriter</module>
<module>clickhousewriter</module>
<!-- common support module -->
<module>plugin-rdbms-util</module>
<module>plugin-unstructured-storage-util</module>
<module>hbase20xsqlreader</module>
<module>hbase20xsqlwriter</module>
</modules>
修改后:
//原始的里面是所有很全的,不过一般都是按需install
<modules>
<module>common</module>
<module>core</module>
<module>transformer</module>
<!-- reader -->
<module>mysqlreader</module>
<!-- writer -->
<module>elasticsearchwriter</module>
<!-- common support module -->
<module>plugin-rdbms-util</module>
<module>plugin-unstructured-storage-util</module>
<module>hbase20xsqlreader</module>
<module>hbase20xsqlwriter</module>
</modules>
3.3 编译生成elasticsearchwriter 插件
mvn clean install -Dmaven.test.skip=true
3.4 复制生成的文件到 /datax/plugin/,注意区分reader 跟writer
cp -r /usr/local/DataX-master/elasticsearchwriter/target/datax/plugin/writer/elasticsearchwriter /usr/local/data/datax/datax/plugin/writer
4、重新运行datax 命令,成功!!!
python /usr/local/datax/bin/datax.py ./test.json -p "-Dstart_time=2020-09-02 -Dend_time=2020-09-02"
5、增量的标准是以时间为准 !!!