flume采集数据 flume采集数据有延迟

转载

flyingsmiling 2024-03-24 12:01:41

文章标签 flume采集数据 flume kafka 自定义 json 文章分类 架构后端开发

问题：

需要将数据库A的数据同步给数据库B。通过采集A库的sql操作日志，在B库中执行。采集A库时的flume读取日志比日志生成时间延迟，且延迟时间递增。

解决：

i3使用自定义正则表达式过滤器，进行数据过滤。

自定义正则表达式过滤器：CustomRegexFilteringInterceptor

使用matches()方法匹配：

Pattern pattern = Pattern.compile(regrex);
if(pattern.matcher(content).matches()){
    return event;
}

将flume-kafka-conf.properties中i3的设置改为：

agentkafka.sources.sr1.interceptors.i3.type = com.ai.csc.boss.flume.interceptor.CustomRegexFilteringInterceptor$Builder

原有：flume-kafka-conf.properties配置如下：

agentkafka.sources = sr1
agentkafka.channels = c1
agentkafka.sinks = sk1

# For each one of the sources, the type is defined
agentkafka.sources.sr1.type = TAILDIR
agentkafka.sources.sr1.channels = c1
agentkafka.sources.sr1.positionFile = ../bin/agentkafka_taildir_position.json
agentkafka.sources.sr1.filegroups = f1
# Location of postgresql log files
agentkafka.sources.sr1.filegroups.f1 = /app/flume/postgresql-20.*csv
agentkafka.sources.sr1.fileHeader = true
agentkafka.sources.sr1.inputCharset = utf-8

#拦截器定义，只保留insert、update、delete类型的日志
agentkafka.sources.sr1.interceptors = i1 i2 i3
agentkafka.sources.sr1.interceptors.i1.type = regex_filter
# Database name of source DB
agentkafka.sources.sr1.interceptors.i1.regex = .*?,\"abcd\",.*
agentkafka.sources.sr1.interceptors.i1.excludeEvents = false
agentkafka.sources.sr1.interceptors.i2.type = regex_filter
agentkafka.sources.sr1.interceptors.i2.regex = .*?(execute.*: |statement: )(insert|update|delete|INSERT|UPDATE|DELETE).*
agentkafka.sources.sr1.interceptors.i2.excludeEvents = false
agentkafka.sources.sr1.interceptors.i3.type = regex_filter
# schema_name and table name which should be ignored
agentkafka.sources.sr1.interceptors.i3.regex = 大堆需要排除的表（大小写）的正则表达式
agentkafka.sources.sr1.interceptors.i3.excludeEvents = true

agentkafka.channels.c1.type = memory
agentkafka.channels.c1.keep-alive = 10
agentkafka.channels.c1.capacity = 100000
agentkafka.channels.c1.transactionCapacity =10000

agentkafka.sinks.sk1.channel = c1
agentkafka.sinks.sk1.type = com.flume.sink.kafka.KafkaSink
# target topic in kafka cluster
agentkafka.sinks.sk1.kafka.topic = TOPIC_ABC
# target ip:port list of kafka cluster
agentkafka.sinks.sk1.kafka.bootstrap.servers = ip:port
agentkafka.sinks.sk1.kafka.producer.key.serializer = org.apache.kafka.common.serialization.LongSerializer
agentkafka.sinks.sk1.kafka.flumeBatchSize = 2000
agentkafka.sinks.sk1.kafka.producer.acks = 1
agentkafka.sinks.sk1.kafka.producer.linger.ms = 1
agentkafka.sinks.sk1.kafka.producer.compression.type = snappy

原因：

正则匹配方法中，find()方法是部分匹配，是查找输入串中与模式匹配的子串；matches()方法是全部匹配，是将整个输入串与模式匹配。过滤大量数据时，matches()比find()方法更快。

flume的RegexFilteringInterceptor使用find()方法匹配字符串。

以下文字可以忽略

好吧，我承认这是一次漫长而且失败的问题查找经历，尽管问题最终得到解决。

为了查明原因，首先开启了flume的http监控。启动命令增加：-Dflume.monitoring.type=http -Dflume.monitoring.port=1234

flume的监控可参考https://www.jianshu.com/p/09493efe0fb8或者自行百度，顺便自行搜索“http监控性能指标”

然后，安装grafana + influxdb + telegraf，进行数据收集展示

参考链接中telegraf的telegraf.conf中使用的是 [[inputs.httpjson]]，但是根据官网，1.6及以上已经不用了，具体请查官网。所以该配置中使用[[inputs.http]]

[[inputs.http]]
	urls = [
		"http://ip:port/metrics"
	]
	method = "GET"
	timeout = "1s"
	json_name_key = "SOURCE.sr1_Type"
	json_string_fields = ["SOURCE.sr1_EventReceivedCount","SOURCE.sr1_EventAcceptedCount","SOURCE.sr1_AppendBatchReceivedCount","SOURCE.sr1_AppendReceivedCount"]
	data_format = "json"

参考：https://kiswo.com/article/1023
http://blog.51cto.com/11512826/2056183