简介: 阿里云开源离线同步工具DataX3.0介绍 一. DataX3.0概览 DataX 是一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。

如果不熟悉的话可以先进行了解:https://developer.aliyun.com/article/59373

源码开源地址:https://github.com/alibaba/DataX?spm=a2c6h.12873639.0.0.21084f64hM6IE9

DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图

类型

数据源

Reader(读)

Writer(写)

文档

RDBMS 关系型数据库

MySQL



 、

           

Oracle    

    √    

    √    

 、

 

SQLServer



 、

 

PostgreSQL



 、

 

DRDS



 、

 

通用RDBMS(支持所有关系型数据库)



 、

阿里云数仓数据存储

ODPS



 、

 

ADS

 


 

OSS



 、

 

OCS



 、

NoSQL数据存储

OTS



 、

 

Hbase0.94



 、

 

Hbase1.1



 、

 

Phoenix4.x



 、

 

Phoenix5.x



 、

 

MongoDB



 、

 

Hive



 、

 

Cassandra



 、

无结构化数据存储

TxtFile



 、

 

FTP



 、

 

HDFS



 、

 

Elasticsearch

 


时间序列数据库

OpenTSDB


 

 

TSDB



 、

1、mysql2es脚本

test.json

{
  "job": {
    "setting": {
      "speed": {
        "channel": 2
      }
    },
    "content": [
      {
        "reader": {
          "name": "mysqlreader",
          "parameter": {
            "username": "datax",
            "password": "123456",
            "where":"updated_at>='${start_time} 00:00:00' and updated_at<='${end_time} 23:59:59'",
            "column": [
              "id",
              "app_id",        
              "collection_phone",
              "transaction_number",
              "pay_amount",             
              "if(auto_tags is null,'',replace(replace(replace(auto_tags,'[',''),']',''),'\"','')) as auto_tags",
              "if(manual_tags is null,'',replace(replace(replace(manual_tags,'[',''),']',''),'\"','')) as manual_tags",
              "if(latest_days_ordered_at is null,'',replace(replace(latest_days_ordered_at,'[',''),']','')) as latest_days_ordered_at",
              "if(latest_days_paid_at is null,'',replace(replace(latest_days_paid_at,'[',''),']','')) as latest_days_paid_at",
              "if(latest_days_visited_at is null,'',replace(replace(latest_days_visited_at,'[',''),']','')) as latest_days_visited_at",
              "latest_ordered_at",            
              "visited_products",
              "ordered_products"
            ],
            "connection": [
              {
                "jdbcUrl": ["jdbc:mysql://127.0.0.1:3306/db_user?com.mysql.jdbc.faultInjection.serverCharsetIndex=45"],
                "table": [
                  "user"
                ]
              }
            ]
          }
        },
        "writer": {
          "name": "elasticsearchwriter",
          "parameter": {
            "endpoint": "http://127.0.0.1:9200",
            "accessId": "elastic",
            "accessKey": "123456",
            "index":"user",
            "type":"traces",
            "settings": {"index" :{"number_of_shards": 5, "number_of_replicas": 1}},
            "batchSize": 5000,
            "splitter": ",",
            "column": [
              {"name":"pk","type":"id"},
              {"name":"app_id","type":"keyword"},            
              {"name":"collection_phone","type":"keyword"},
              {"name":"transaction_number","type":"integer"},
              {"name":"pay_amount","type":"integer"},
              {"name":"auto_tags","type":"keyword","array":true},
              {"name":"manual_tags","type":"keyword","array":true},
              {"name":"latest_days_ordered_at","type":"long","array":true},
              {"name":"latest_days_paid_at","type":"long","array":true},
              {"name":"latest_days_visited_at","type":"long","array":true},
              {"name":"latest_ordered_at","type":"long"},           
              {"name":"visited_products","type":"nested"},
              {"name":"ordered_products","type":"nested"}
            ]
          }
        }
      }
    ]
  }
}

2、运行datax脚本

python /usr/local/datax/bin/datax.py ./test.json -p "-Dstart_time=2020-09-02 -Dend_time=2020-09-02"

2.1  插件[mysqlreader,elasticsearchwriter]加载失败

运行完直接报错了,报错如下:

datax数据同步 writeMode中可以写哪些 datax 增量同步_elasticsearch

2020-09-02 15:49:33.747 [main] WARN  ConfigParser - 插件[mysqlreader,elasticsearchwriter]加载失败,1s后重试... Exception:Code:[Framework-12], Description:[DataX插件初始化错误, 该问题通常是由于DataX安装错误引起,请联系您的运维解决 .].  - 插件加载失败,未完成指定插件加载:[elasticsearchwriter, mysqlreader]
2020-09-02 15:49:34.765 [main] ERROR Engine -

经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Framework-12], Description:[DataX插件初始化错误, 该问题通常是由于DataX安装错误引起,请联系您的运维解决 .].  - 插件加载失败,未完成指定插件加载:[elasticsearchwriter, mysqlreader]
        at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:26)
        at com.alibaba.datax.core.util.ConfigParser.parsePluginConfig(ConfigParser.java:142)
        at com.alibaba.datax.core.util.ConfigParser.parse(ConfigParser.java:63)
        at com.alibaba.datax.core.Engine.entry(Engine.java:137)
        at com.alibaba.datax.core.Engine.main(Engine.java:204)

 2.2 检查是否装有mysqlreder,elasticsearchwriter插件

  那既然说加载不成功,那我们就去看吗,拿数据说话

mysqlreder已存在!!

  

datax数据同步 writeMode中可以写哪些 datax 增量同步_mysql_02

哦豁,好像真的没有 elasticsearchwriter,小点声马上去安装。。。

 

datax数据同步 writeMode中可以写哪些 datax 增量同步_elasticsearch_03

3、安装elasticsearchwriter组件(没装过插件的小朋友,装过的可以直接跳过)

  3.1  拉取DataX项目源码到服务器 DataX-master

  3.2  修改根目录下的pom.xml文件,按需修改

//原始的里面是所有很全的,不过一般都是按需install
<modules>
        <module>common</module>
        <module>core</module>
        <module>transformer</module>

        <!-- reader -->
        <module>mysqlreader</module>
        <module>drdsreader</module>
        <module>sqlserverreader</module>
        <module>postgresqlreader</module>
        <module>oraclereader</module>
        <module>odpsreader</module>
        <module>otsreader</module>
        <module>otsstreamreader</module>
        <module>txtfilereader</module>
        <module>hdfsreader</module>
        <module>streamreader</module>
        <module>ossreader</module>
        <module>ftpreader</module>
        <module>mongodbreader</module>
        <module>rdbmsreader</module>
        <module>hbase11xreader</module>
        <module>hbase094xreader</module>
        <module>tsdbreader</module>
        <module>opentsdbreader</module>
        <module>cassandrareader</module>
        <module>gdbreader</module>

        <!-- writer -->
        <module>mysqlwriter</module>
        <module>drdswriter</module>
        <module>odpswriter</module>
        <module>txtfilewriter</module>
        <module>ftpwriter</module>
        <module>hdfswriter</module>
        <module>streamwriter</module>
        <module>otswriter</module>
        <module>oraclewriter</module>
        <module>sqlserverwriter</module>
        <module>postgresqlwriter</module>
        <module>osswriter</module>
        <module>mongodbwriter</module>
        <module>adswriter</module>
        <module>ocswriter</module>
        <module>rdbmswriter</module>
        <module>hbase11xwriter</module>
        <module>hbase094xwriter</module>
        <module>hbase11xsqlwriter</module>
        <module>hbase11xsqlreader</module>
        <module>elasticsearchwriter</module>
        <module>tsdbwriter</module>
        <module>adbpgwriter</module>
        <module>gdbwriter</module>
        <module>cassandrawriter</module>
        <module>clickhousewriter</module>
        <!-- common support module -->
        <module>plugin-rdbms-util</module>
        <module>plugin-unstructured-storage-util</module>
        <module>hbase20xsqlreader</module>
        <module>hbase20xsqlwriter</module>
    </modules>

  修改后:

//原始的里面是所有很全的,不过一般都是按需install
<modules>
        <module>common</module>
        <module>core</module>
        <module>transformer</module>

        <!-- reader -->
        <module>mysqlreader</module>
        

        <!-- writer -->
       
        <module>elasticsearchwriter</module>
        

        <!-- common support module -->
        <module>plugin-rdbms-util</module>
        <module>plugin-unstructured-storage-util</module>
        <module>hbase20xsqlreader</module>
        <module>hbase20xsqlwriter</module>
    </modules>

  3.3 编译生成elasticsearchwriter 插件

mvn clean install -Dmaven.test.skip=true

  3.4 复制生成的文件到 /datax/plugin/,注意区分reader 跟writer

cp -r /usr/local/DataX-master/elasticsearchwriter/target/datax/plugin/writer/elasticsearchwriter /usr/local/data/datax/datax/plugin/writer

4、重新运行datax 命令,成功!!!

python /usr/local/datax/bin/datax.py ./test.json -p "-Dstart_time=2020-09-02 -Dend_time=2020-09-02"

datax数据同步 writeMode中可以写哪些 datax 增量同步_elasticsearch_04

5、增量的标准是以时间为准 !!!

datax数据同步 writeMode中可以写哪些 datax 增量同步_elasticsearch_05