软件安装服务器

flume实时搜集mysql增量数据到kafka整理_java

安装kafka

----------------

         0.选择三台主机安装kafka

         1.准备zk

         2.jdk

         3.tar解压文件kafka_2.11-2.2.0.tgz

         4.环境变量

                   /etc/profile文件内容

                   exportKAFKA_HOME=/opt/kafka

                   exportPATH=$PATH:$KAFKA_HOME/bin

 

         5.配置kafka

                   [kafka/config/server.properties]

                   ...

                   broker.id=0

                   ...

                   listeners=PLAINTEXT://:9092

                   ...

                   log.dirs=/opt/kafka/kafka-logs

                   ...

                   zookeeper.connect=server2:2181,server3:2181,server4:2181

        

         6.分发server.properties到其它主机,同时修改每个文件的broker.id 分别012

        

         7.启动kafka服务器

                   a)先启动zk

                   b)启动kafka

                            [server1  server3 server4]

                            $>bin/kafka-server-start.shconfig/server.properties

 

                   c)验证kafka服务器是否启动

                            $>netstat-anop | grep 9092

        

         8.创建主题

                   $>bin/kafka-topics.sh--create --zookeeper server3:2181 --replication-factor 3 --partitions 3 --topictest

 

         9.查看主题列表

                   $>bin/kafka-topics.sh--list --zookeeper server1:2181

 

         10.启动控制台生产者

                   $>bin/kafka-console-producer.sh--broker-list server3:9092 --topic test

 

         11.启动控制台消费者

                   $>bin/kafka-console-consumer.sh--bootstrap-server server3:9092 --topic test --from-beginning --zookeeperserver3:2181

 

         12.在生产者控制台输入helloworld, 消费者可收到消息

        

         13.删除主题

                   $>bin/kafka-topics--delete --topic test --zookeeper server3:2181

 

 

 

 

二:安装flume

-------------------------------------

 

         1.下载apache-flume-1.7.0-bin.tar.gz

         2.解压到安装目录

         3.配置环境变量

                   /etc/profile文件内容

                   exportFLUME_HOME=/opt/flume

                   exportPATH=$PATH:$FLUME_HOME/bin

         4.验证flume是否成功

                   $>flume-ngversion

 

二:配置flume获取mysql增量数据下沉到kafka,注意要先启动zk集群和kafka集群

------------------------------------

         1.下载mysql-connector-java-5.1.48.jarflume-ng-sql-source-json-1.0.jar 两个jar 上传到 FLUME_HOME/lib/文件夹下

        

         2.copy模板配置文件flume-conf.properties.templatemysql_kafka_test.conf  文件内容如下:

        

                   #数据来源

                   sync.sources= s-1

                   #数据通道

                   sync.channels= c-1

                   #数据去处,这里配置了failover,根据下面的优先级配置,会先启用k-1k-1挂了后再启用k-2

                   sync.sinks= k-1 k-2

 

                   #这个是配置failover的关键,需要有一个sinkgroup

                   sync.sinkgroups= g-1

                   sync.sinkgroups.g-1.sinks= k-1 k-2

                   #处理的类型是failover

                   sync.sinkgroups.g-1.processor.type= failover

                   #优先级,数字越大优先级越高,每个sink的优先级必须不相同

                   sync.sinkgroups.g-1.processor.priority.k-1= 5

                   sync.sinkgroups.g-1.processor.priority.k-2= 10

                   #设置为10秒,当然可以根据你的实际状况更改成更快或者很慢

                   sync.sinkgroups.g-1.processor.maxpenalty= 10000

 

                   ##########数据通道的定义

                   #数据量不大,直接放内存。其实还可以放在JDBCkafka或者磁盘文件等

                   sync.channels.c-1.type= memory

                   #通道队列的最大长度

                   sync.channels.c-1.capacity= 100000

                   #putListtakeList队列的最大长度,sinkcapacity中抓取batchsizeevent,放到这个队列。所以此参数最好比capacity小,比sinkbatchsize大。

                   #官方定义:The maximum number of events the channel will take from a source orgive to a sink per transaction.

                   sync.channels.c-1.transactionCapacity= 1000

                   sync.channels.c-1.byteCapacityBufferPercentage= 20

                   ###默认值的默认值等于JVM可用的最大内存的80%,可以不配置

                   #sync.channels.c-1.byteCapacity = 800000

 

                   #########sqlsource#################

                   #source s-1用到的通道,和sink的通道要保持一致,否则就GG

                   sync.sources.s-1.channels=c-1

                   #########For each one of the sources, the type is defined

                   sync.sources.s-1.type= org.keedio.flume.source.SQLSource

                   sync.sources.s-1.hibernate.connection.url= jdbc:mysql://10.22.20.70:3306/avatar

                   #########Hibernate Database connection properties

                   sync.sources.s-1.hibernate.connection.user= root

                   sync.sources.s-1.hibernate.connection.password= root

                   sync.sources.s-1.hibernate.connection.autocommit= true

                   sync.sources.s-1.hibernate.dialect= org.hibernate.dialect.MySQL5Dialect

                   sync.sources.s-1.hibernate.connection.driver_class= com.mysql.jdbc.Driver

                   sync.sources.s-1.run.query.delay=10000

                   sync.sources.s-1.status.file.path= /opt/flume/record-datas/avatar/test_status

                   #使用方式见上面的启动说明

                   sync.sources.s-1.status.file.name= sqlSource.status

                   ########Custom query

                   sync.sources.s-1.start.from= 0

                   #sql语句自定义,但是要注意:增量只能针对id字段即主键列,经测试系统默认如此.而且必须要将主键查询出来,

                   #因为如果不查询主键,flume无法记录上一次查询的位置.$@$表示增量列上一次查询的值,记录在test_status文件中

                   sync.sources.s-1.custom.query= select a.id, a.cusId, b.flag, b.num from a_test a inner join a_test1 b ona.id=b.id where a.id > $@$ order by a.id asc

                   sync.sources.s-1.batch.size= 100

                   sync.sources.s-1.max.rows= 100

                   sync.sources.s-1.hibernate.connection.provider_class= org.hibernate.connection.C3P0ConnectionProvider

                   sync.sources.s-1.hibernate.c3p0.min_size=5

                   sync.sources.s-1.hibernate.c3p0.max_size=20

 

                   #########sinks 1

                   #sink k-1用到的通道,和source的通道要保持一致,否则取不到数据

                   sync.sinks.k-1.channel= c-1

                   sync.sinks.k-1.type= org.apache.flume.sink.kafka.KafkaSink

                   sync.sinks.k-1.kafka.topic= avatar_test

                   sync.sinks.k-1.kafka.bootstrap.servers= server3:9092

                   sync.sinks.k-1.kafka.producer.acks= 1

                   #每批次处理的event数量

                   sync.sinks.k-1.kafka.flumeBatchSize  = 100

 

                   #########sinks 2

                   #sink k-2用到的通道,和source的通道要保持一致,否则取不到数据

                   sync.sinks.k-2.channel= c-1

                   sync.sinks.k-2.type= org.apache.flume.sink.kafka.KafkaSink

                   sync.sinks.k-2.kafka.topic= avatar_test

                   sync.sinks.k-2.kafka.bootstrap.servers= server4:9092

                   sync.sinks.k-2.kafka.producer.acks= 1

                   sync.sinks.k-2.kafka.flumeBatchSize  = 100

        

         3.注意创建记录文件夹sql查询记录文件夹/opt/flume/record-datas/avatar/test_status

 

         4.启动flume,FLUME_HOME下面执行以下命令:

                   flume-ngagent -c conf –f conf/mysql_kafka_test.conf –n sync-Dflume.root.logger=INFO,console

 

         5.启动kafka消费者进行查看mysql增量数据:

                   kafka-console-consumer.sh--bootstrap-server server3:9092 --topic avatar_test --from-beginning--zookeeper server3:2181

                  

         6.可利用Java编写代码,kafka客户端创建消费者接收数据,利用hadoop客户端将数据实时写入hdfs

          


          

配置文件  flume对接kafka数据 下沉到hdfs ----  kafka_hdfs.conf

 

 

                   a1.sources= r1

                   a1.sinks= k1

                   a1.channels= c1

 

                   a1.sources.r1.type= org.apache.flume.source.kafka.KafkaSource

                   a1.sources.r1.batchSize= 5000

                   a1.sources.r1.batchDurationMillis= 2000

                   a1.sources.r1.kafka.bootstrap.servers= server3:9092

                   a1.sources.r1.kafka.topics= test

                   a1.sources.r1.kafka.consumer.group.id= g1

 

                   a1.sinks.k1.type= hdfs

                   a1.sinks.k1.hdfs.path= /user/test/flume/%y/%m/%d/%H/%M/%S

                   a1.sinks.k1.hdfs.filePrefix= events-

                   a1.sinks.k1.hdfs.round= true

                   a1.sinks.k1.hdfs.roundValue= 20

                   a1.sinks.k1.hdfs.roundUnit= second

                   a1.sinks.k1.hdfs.useLocalTimeStamp=true

                   a1.sinks.k1.hdfs.rollInterval=10

                   a1.sinks.k1.hdfs.rollSize=10

                   a1.sinks.k1.hdfs.rollCount=3

 

                   a1.channels.c1.type=memory

 

                   a1.sources.r1.channels= c1

                   a1.sinks.k1.channel= c1

                  

配置文件  flume实时对接日志文件 下沉到hdfs

 

                   a1.sources= r1

                   a1.sinks= k1

                   a1.channels= c1

 

                   a1.sources.r1.type=exec

                   a1.sources.r1.command=tail-F /home/server/test.log

 

                   a1.sinks.k1.type= hdfs

                   a1.sinks.k1.hdfs.path= /user/test/flume/%y/%m/%d/%H/%M/%S

                   a1.sinks.k1.hdfs.filePrefix= events-

                   a1.sinks.k1.hdfs.round= true

                   a1.sinks.k1.hdfs.roundValue= 20

                   a1.sinks.k1.hdfs.roundUnit= second

                   a1.sinks.k1.hdfs.useLocalTimeStamp=true

                   a1.sinks.k1.hdfs.rollInterval=10

                   a1.sinks.k1.hdfs.rollSize=10

                   a1.sinks.k1.hdfs.rollCount=3

 

                   a1.channels.c1.type=memory

 

                   a1.sources.r1.channels= c1

                   a1.sinks.k1.channel= c1

                  

配置文件  flume实时对接日志文件夹 下沉到hdfs

 

                   a1.sources= r1

                   a1.sinks= k1

                   a1.channels= c1

 

                   a1.sources.r1.type=spooldir

                   a1.sources.r1.spoolDir=/home/server/spool

                   a1.sources.r1.fileHeader=true

 

                   a1.sinks.k1.type= hdfs

                   a1.sinks.k1.hdfs.path= /user/test/flume/%y/%m/%d/%H/%M/%S

                   a1.sinks.k1.hdfs.filePrefix= events-

                   a1.sinks.k1.hdfs.round= true

                   a1.sinks.k1.hdfs.roundValue= 20

                   a1.sinks.k1.hdfs.roundUnit= second

                   a1.sinks.k1.hdfs.useLocalTimeStamp=true

                   a1.sinks.k1.hdfs.rollInterval=10

                   a1.sinks.k1.hdfs.rollSize=10

                   a1.sinks.k1.hdfs.rollCount=3

 

                   a1.channels.c1.type=memory

 

                   a1.sources.r1.channels= c1

                   a1.sinks.k1.channel= c1