软件安装服务器:
一:安装kafka
----------------
0.选择三台主机安装kafka
1.准备zk
2.jdk
3.tar解压文件kafka_2.11-2.2.0.tgz
4.环境变量
/etc/profile文件内容
exportKAFKA_HOME=/opt/kafka
exportPATH=$PATH:$KAFKA_HOME/bin
5.配置kafka
[kafka/config/server.properties]
...
broker.id=0
...
listeners=PLAINTEXT://:9092
...
log.dirs=/opt/kafka/kafka-logs
...
zookeeper.connect=server2:2181,server3:2181,server4:2181
6.分发server.properties到其它主机,同时修改每个文件的broker.id 分别0,1,2
7.启动kafka服务器
a)先启动zk
b)启动kafka
[server1 server3 server4]
$>bin/kafka-server-start.shconfig/server.properties
c)验证kafka服务器是否启动
$>netstat-anop | grep 9092
8.创建主题
$>bin/kafka-topics.sh--create --zookeeper server3:2181 --replication-factor 3 --partitions 3 --topictest
9.查看主题列表
$>bin/kafka-topics.sh--list --zookeeper server1:2181
10.启动控制台生产者
$>bin/kafka-console-producer.sh--broker-list server3:9092 --topic test
11.启动控制台消费者
$>bin/kafka-console-consumer.sh--bootstrap-server server3:9092 --topic test --from-beginning --zookeeperserver3:2181
12.在生产者控制台输入helloworld, 消费者可收到消息
13.删除主题
$>bin/kafka-topics--delete --topic test --zookeeper server3:2181
二:安装flume
-------------------------------------
1.下载apache-flume-1.7.0-bin.tar.gz
2.解压到安装目录
3.配置环境变量
/etc/profile文件内容
exportFLUME_HOME=/opt/flume
exportPATH=$PATH:$FLUME_HOME/bin
4.验证flume是否成功
$>flume-ngversion
二:配置flume获取mysql增量数据下沉到kafka,注意要先启动zk集群和kafka集群
------------------------------------
1.下载mysql-connector-java-5.1.48.jar,flume-ng-sql-source-json-1.0.jar 两个jar包 上传到 FLUME_HOME/lib/文件夹下
2.copy模板配置文件flume-conf.properties.template为mysql_kafka_test.conf 文件内容如下:
#数据来源
sync.sources= s-1
#数据通道
sync.channels= c-1
#数据去处,这里配置了failover,根据下面的优先级配置,会先启用k-1,k-1挂了后再启用k-2
sync.sinks= k-1 k-2
#这个是配置failover的关键,需要有一个sinkgroup
sync.sinkgroups= g-1
sync.sinkgroups.g-1.sinks= k-1 k-2
#处理的类型是failover
sync.sinkgroups.g-1.processor.type= failover
#优先级,数字越大优先级越高,每个sink的优先级必须不相同
sync.sinkgroups.g-1.processor.priority.k-1= 5
sync.sinkgroups.g-1.processor.priority.k-2= 10
#设置为10秒,当然可以根据你的实际状况更改成更快或者很慢
sync.sinkgroups.g-1.processor.maxpenalty= 10000
##########数据通道的定义
#数据量不大,直接放内存。其实还可以放在JDBC,kafka或者磁盘文件等
sync.channels.c-1.type= memory
#通道队列的最大长度
sync.channels.c-1.capacity= 100000
#putList和takeList队列的最大长度,sink从capacity中抓取batchsize个event,放到这个队列。所以此参数最好比capacity小,比sink的batchsize大。
#官方定义:The maximum number of events the channel will take from a source orgive to a sink per transaction.
sync.channels.c-1.transactionCapacity= 1000
sync.channels.c-1.byteCapacityBufferPercentage= 20
###默认值的默认值等于JVM可用的最大内存的80%,可以不配置
#sync.channels.c-1.byteCapacity = 800000
#########sqlsource#################
#source s-1用到的通道,和sink的通道要保持一致,否则就GG了
sync.sources.s-1.channels=c-1
#########For each one of the sources, the type is defined
sync.sources.s-1.type= org.keedio.flume.source.SQLSource
sync.sources.s-1.hibernate.connection.url= jdbc:mysql://10.22.20.70:3306/avatar
#########Hibernate Database connection properties
sync.sources.s-1.hibernate.connection.user= root
sync.sources.s-1.hibernate.connection.password= root
sync.sources.s-1.hibernate.connection.autocommit= true
sync.sources.s-1.hibernate.dialect= org.hibernate.dialect.MySQL5Dialect
sync.sources.s-1.hibernate.connection.driver_class= com.mysql.jdbc.Driver
sync.sources.s-1.run.query.delay=10000
sync.sources.s-1.status.file.path= /opt/flume/record-datas/avatar/test_status
#使用方式见上面的启动说明
sync.sources.s-1.status.file.name= sqlSource.status
########Custom query
sync.sources.s-1.start.from= 0
#sql语句自定义,但是要注意:增量只能针对id字段即主键列,经测试系统默认如此.而且必须要将主键查询出来,
#因为如果不查询主键,flume无法记录上一次查询的位置.$@$表示增量列上一次查询的值,记录在test_status文件中
sync.sources.s-1.custom.query= select a.id, a.cusId, b.flag, b.num from a_test a inner join a_test1 b ona.id=b.id where a.id > $@$ order by a.id asc
sync.sources.s-1.batch.size= 100
sync.sources.s-1.max.rows= 100
sync.sources.s-1.hibernate.connection.provider_class= org.hibernate.connection.C3P0ConnectionProvider
sync.sources.s-1.hibernate.c3p0.min_size=5
sync.sources.s-1.hibernate.c3p0.max_size=20
#########sinks 1
#sink k-1用到的通道,和source的通道要保持一致,否则取不到数据
sync.sinks.k-1.channel= c-1
sync.sinks.k-1.type= org.apache.flume.sink.kafka.KafkaSink
sync.sinks.k-1.kafka.topic= avatar_test
sync.sinks.k-1.kafka.bootstrap.servers= server3:9092
sync.sinks.k-1.kafka.producer.acks= 1
#每批次处理的event数量
sync.sinks.k-1.kafka.flumeBatchSize = 100
#########sinks 2
#sink k-2用到的通道,和source的通道要保持一致,否则取不到数据
sync.sinks.k-2.channel= c-1
sync.sinks.k-2.type= org.apache.flume.sink.kafka.KafkaSink
sync.sinks.k-2.kafka.topic= avatar_test
sync.sinks.k-2.kafka.bootstrap.servers= server4:9092
sync.sinks.k-2.kafka.producer.acks= 1
sync.sinks.k-2.kafka.flumeBatchSize = 100
3.注意创建记录文件夹sql查询记录文件夹/opt/flume/record-datas/avatar/test_status
4.启动flume,在FLUME_HOME下面执行以下命令:
flume-ngagent -c conf –f conf/mysql_kafka_test.conf –n sync-Dflume.root.logger=INFO,console
5.启动kafka消费者进行查看mysql增量数据:
kafka-console-consumer.sh--bootstrap-server server3:9092 --topic avatar_test --from-beginning--zookeeper server3:2181
6.可利用Java编写代码,kafka客户端创建消费者接收数据,利用hadoop客户端将数据实时写入hdfs
三. 配置文件 flume对接kafka数据 下沉到hdfs ---- kafka_hdfs.conf
a1.sources= r1
a1.sinks= k1
a1.channels= c1
a1.sources.r1.type= org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize= 5000
a1.sources.r1.batchDurationMillis= 2000
a1.sources.r1.kafka.bootstrap.servers= server3:9092
a1.sources.r1.kafka.topics= test
a1.sources.r1.kafka.consumer.group.id= g1
a1.sinks.k1.type= hdfs
a1.sinks.k1.hdfs.path= /user/test/flume/%y/%m/%d/%H/%M/%S
a1.sinks.k1.hdfs.filePrefix= events-
a1.sinks.k1.hdfs.round= true
a1.sinks.k1.hdfs.roundValue= 20
a1.sinks.k1.hdfs.roundUnit= second
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.rollInterval=10
a1.sinks.k1.hdfs.rollSize=10
a1.sinks.k1.hdfs.rollCount=3
a1.channels.c1.type=memory
a1.sources.r1.channels= c1
a1.sinks.k1.channel= c1
四. 配置文件 flume实时对接日志文件 下沉到hdfs
a1.sources= r1
a1.sinks= k1
a1.channels= c1
a1.sources.r1.type=exec
a1.sources.r1.command=tail-F /home/server/test.log
a1.sinks.k1.type= hdfs
a1.sinks.k1.hdfs.path= /user/test/flume/%y/%m/%d/%H/%M/%S
a1.sinks.k1.hdfs.filePrefix= events-
a1.sinks.k1.hdfs.round= true
a1.sinks.k1.hdfs.roundValue= 20
a1.sinks.k1.hdfs.roundUnit= second
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.rollInterval=10
a1.sinks.k1.hdfs.rollSize=10
a1.sinks.k1.hdfs.rollCount=3
a1.channels.c1.type=memory
a1.sources.r1.channels= c1
a1.sinks.k1.channel= c1
五. 配置文件 flume实时对接日志文件夹 下沉到hdfs
a1.sources= r1
a1.sinks= k1
a1.channels= c1
a1.sources.r1.type=spooldir
a1.sources.r1.spoolDir=/home/server/spool
a1.sources.r1.fileHeader=true
a1.sinks.k1.type= hdfs
a1.sinks.k1.hdfs.path= /user/test/flume/%y/%m/%d/%H/%M/%S
a1.sinks.k1.hdfs.filePrefix= events-
a1.sinks.k1.hdfs.round= true
a1.sinks.k1.hdfs.roundValue= 20
a1.sinks.k1.hdfs.roundUnit= second
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.rollInterval=10
a1.sinks.k1.hdfs.rollSize=10
a1.sinks.k1.hdfs.rollCount=3
a1.channels.c1.type=memory
a1.sources.r1.channels= c1
a1.sinks.k1.channel= c1