clickhouse 导入外部mysql 数据 clickhouse 数据迁移

转载

mob64ca14061c9e 2023-12-08 13:32:26

文章标签 zookeeper 数据 xml配置 文章分类 MySQL 数据库

在使用clickhouse的时候，可能会有不同集群间迁移数据需求，这里可以使用如下几种方式：

DETACH/FREEZE分区，进行SCP拷贝，然后再ATTACH
alter table db.table DETACH PARTITION [partition]; #下线分区
alter table db.table FREEZE PARTITION [partition]; #备份分区
alter table db.table ATTACH PARTITION [partition]; #上线分区
利用remote函数
insert into ... select * from remote('ip',db.table,'user','password')
clickhouse-copier工具
这个工具是标准发布的clickhouse server的一部分，它可以在完全并行的模式下工作, 并以最有效的方式分发数据

三种方式的优缺点：

方式	优点	缺点
DETACH/FREEZE	适用小表；	源和目标集群分区数量需要一样；操作较繁琐；
remote	适用小表；操作方便；	大表速度较慢；
clickhouse-copier	并行操作；可以变更表名主键；可以变更分区；	配置繁琐；需要借助zookeeper使用；

本文主要介绍Clickhouse-copier的使用方式

Clickhouse-copier是在安装clickhouse软件后自带的工具命令。

> clickhouse-copier --help

usage: clickhouse-copier --config-file <config-file> --task-path <task-path> Copies tables from one cluster to another
--daemon	★守护进程
--umask=mask	设置守护进程的umask
--pidfile=path	Pid文件路径
-C<file>, --config-file=<file>	★配置文件，zookeeper等信息
-L<file>, --log-file=<file>	日志文件
-E<file>, --errorlog-file=<file>	错误日志文件
-P<file>, --pid-file=<file>	Pid文件
--task-path=task-path	★Zookeeper中的任务路径
--safe-mode	★禁止ALTER DROP PARTITION
--copy-fault-probability=copy-fault-probability	指定分区时，测试分区状态
--log-level=log-level	日志级别，debug
--base-dir=base-dir	★默认当前路径，生成目录clickhouse-copier_日期_Pid
--help	查看帮助

标★的比较重要，通常情况只需指定--daemon、--config和--task-path ，其他采用默认即可。

使用Clickhouse-copier需要借助zookeeper，为减少网络流量，建议clickhouse-copier在源数据所在的服务器上运行。

一、首先需要准备一个schema.xml配置

包括源和目标的集群分片信息，以及需要同步的表信息

<yandex>
 <!--ck集群节点-->
 <remote_servers>
 <!--原集群-->
 <source_cluster>
 <!--分片01-->
 <shard>
 <weight>1</weight>
 <replica>
 <host>10.10.1.1</host>
 <user>user</user>
 <password>password</password>
 <port>9000</port>
 </replica>
 </shard>
 <!--分片02-->
 <shard>
 <weight>1</weight>
 <replica>
 <host>10.10.1.2</host>
 <user>user</user>
 <password>password</password>
 <port>9000</port>
 </replica>
 </shard>
 </source_cluster>
 <!--目标集群-->
 <destination_cluster>
 <shard>
 <weight>1</weight>
 <replica>
 <host>10.10.1.3</host>
 <user>user</user>
 <password>password</password>
 <port>9000</port>
 </replica>
 </shard>
 <shard>
 <weight>1</weight>
 <replica>
 <host>10.10.1.4</host>
 <user>user</user>
 <password>password</password>
 <port>9000</port>
 </replica>
 </shard>
 </destination_cluster>
 </remote_servers>

 <!--最大工作线程-->
 <max_workers>2</max_workers>
 <!--源拉数据节点-->
 <settings_pull>
     <readonly>1</readonly>
 </settings_pull>
 <!--目标写数据节点-->
 <settings_push>
     <readonly>0</readonly>
 </settings_push>
 <!--上面的配置-->
 <settings>
     <connect_timeout>3</connect_timeout>
     <!-- Sync insert is set forcibly, leave it here just in case. -->
     <insert_distributed_sync>1</insert_distributed_sync>
 </settings>
 <!--需要同步的表，每个任务一个表-->
 <tables>
     <test1>
         <!--源-->
         <cluster_pull>source_cluster</cluster_pull>
         <database_pull>default</database_pull>
         <table_pull>test1</table_pull>
         <!--目标-->
         <cluster_push>destination_cluster</cluster_push>
         <database_push>default</database_push>
         <table_push>test1</table_push>
         <engine>
             ENGINE=ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/default/test1', '{replica}')
             PARTITION BY toMonday(EventDate)
             ORDER BY (ID, EventDate)
         </engine>
         <sharding_key>rand()</sharding_key>
         <!--where_condition>ID != 0</where_condition-->
     </test1>
     <!--下一个表-->
 </tables>
 </yandex>

关于schema.xml配置格式可参考： https://clickhouse.yandex/docs/en/operations/utils/clickhouse-copier/

二、完成schema.xml配置后，需要将此配置上载至 Zookeeper 节点的特定路径下 (/<task-path>/description)

可以创建多个任务

在zookeeper随便一个节点机器执行以下命令：

> ./zkCli.sh create /clickhouse/copytasks ""
 > ./zkCli.sh create /clickhouse/copytasks/task1 ""            
 > ./zkCli.sh create /clickhouse/copytasks/task1/description "`cat schema.xml`"

三、准备zookeeper.xml配置文件

<yandex>
<zookeeper>
<node index="1">
 <host>10.1.1.5</host>
 <port>2181</port>
</node>
</zookeeper>
<logger>
   <level>trace</level>
   <log>./log/log.log</log>
   <errorlog>./log/log.err.log</errorlog>
   <size>1000M</size>
   <count>10</count>
   <stderr>./log/stderr.log</stderr>
   <stdout>./log/stdout.log</stdout>
</logger>
</yandex>

四、在clickhouse机器启动

源和目标都可以，为减少网络流量，建议clickhouse-copier在源数据所在的服务器上运行。

> clickhouse-copier --config zookeeper.xml --task-path /clickhouse/copytasks/task1 --daemon

工具启动后，需要一段时间才能完成任务，具体取决于要复制的表的大小。若未指定--base-dir，则在当前所在目录下生成 clickhouse-copier_时间_pid 格式的目录，目录下包含两个日志文件，可以通过这两个日志文件查看复制错误及详情。

log.err.log #记录错误，此错误是目标没有表，但会自动创建

log.log #记录执行详细信息

五、注意事项
1、源和目标集群名称不能一样，会把源覆盖掉（重要）
2、如果目标没有库，不会自动创建库，需要提前建库
3、如果目标没有表，会自动创建表
4、复制表，副本需要手动创建表
5、目标库可以更改表名、分区、排序等
六、性能

源	目标	数据量	数据大小	时间
1片	1片	60亿	80G	180分钟
1片	2片	60亿	80G	180分钟
2片	2片	60亿	80G	60分钟
6片	6片	60亿	80G	40分钟

七、结论

Clickhouse-copier可以在不同集群间迁移数据，还可以用于重新分片或更改表名及主键。在一个1对1的环境中，它的性能与insert ... select相同，但当它应用于大型ClickHouse集群时，它的性能拷贝速度会有很大提升，并且规避了并行复制任务带来的诸多问题。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：ice python ice python序列化

下一篇：Hashredis删除操作 Hash hashset removeall

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

clickhouse 导入外部mysql 数据 clickhouse 数据迁移

clickhouse 导入外部mysql 数据 clickhouse 数据迁移

51CTO博客