flume读取logback日志文件 flume日志采集实例

转载

云端小悟空 2024-08-02 12:36:39

文章标签 flume读取logback日志文件 hdfs 配置文件 ci 文章分类 架构后端开发

flume抽取日志文件

对于flume的原理其实很容易理解，我们更应该掌握flume的具体使用方法，flume提供了大量内置的Source、Channel和Sink类型。而且不同类型的Source、Channel和Sink可以自由组合—–组合方式基于用户设置的配置文件，非常灵活。比如：Channel可以把事件暂存在内存里，也可以持久化到本地硬盘上。Sink可以把日志写入HDFS, hbase，甚至是另外一个Source等等。
其实flume的用法主要在于配置文件的配置，在配置文件当中描述source、channel与sink的具体实现，而后运行一个agent实例，在运行agent实例的过程中会读取配置文件的内容，这样flume就会采集到数据。

配置文件的编写原则：
1>从整体上描述代理agent中sources、sinks、channels所涉及到的组件

# Name the components on this agent
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1

2>详细描述agent中每一个source、sink与channel的具体实现：即在描述source的时候，需要
指定source到底是什么类型的，即这个source是接受文件的、还是接受http的、还是接受thrift
的；对于sink也是同理，需要指定结果是输出到HDFS中，还是Hbase中啊等等；对于channel
需要指定是内存啊，还是数据库啊，还是文件啊等等。

# Describe/configure the source
    a1.sources.r1.type = netcat
    a1.sources.r1.bind = localhost
    a1.sources.r1.port = 44444

    # Describe the sink
    a1.sinks.k1.type = logger

    # Use a channel which buffers events in memory
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100

3>通过channel将source与sink连接起来

# Bind the source and sink to the channel
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1

启动agent的shell操作：

flume-ng  agent -n a1  -c  ../conf   -f  ../conf/example.file  
    -Dflume.root.logger=DEBUG,console

参数说明： -n 指定agent名称(与配置文件中代理的名字相同)
-c 指定flume中配置文件的目录
-f 指定配置文件
-Dflume.root.logger=DEBUG,console 设置日志等级

1.配置a2.conf

=====修改a2.conf====

#a2:agent name

a2.sources = r2

a2.channels = c2

a2.sinks = k2

# define sources
 #主动获取日志
 a2.sources.r2.type = exec
 #获取日志的命令（注意要有权限,监听的web项目日志记得写全地址）
 a2.sources.r2.command = tail -F /var/log/httpd/access_log
 #上一行命令所运行的环境
 a2.sources.r2.shell = /bin/bash -c


 # define channels
 a2.channels.c2.type = memory
 a2.channels.c2.capacity = 1000
 a2.channels.c2.transactionCapacity = 100


 # define sinks
 #目标上传到hdfs
 a2.sinks.k2.type = hdfs
 a2.sinks.k2.hdfs.path=hdfs://[hostname]:8020/flume/%Y%m%d/%H
 a2.sinks.k2.hdfs.filePrefix = accesslog
 #启用按时间生成文件夹
 a2.sinks.k2.hdfs.round=true
 #设置roundValue:1，round单位：小时  
 a2.sinks.k2.hdfs.roundValue=1
 a2.sinks.k2.hdfs.roundUnit=hour
 #使用本地时间戳（这个必须设置不然会报错）
 a2.sinks.k2.hdfs.useLocalTimeStamp=true
 #多少个events会flush to hdfs
 a2.sinks.k2.hdfs.batchSize=1000
 # File format: 默认是SequenceFile（key:value对），DataStream是无压缩的一般数据流
 a2.sinks.k2.hdfs.fileType=DataStream
 #序列化的格式Text
 a2.sinks.k2.hdfs.writeFormat=Text


 #设置解决文件过多、过小问题
 #每600秒生成一个文件
 a2.sinks.k2.hdfs.rollInterval=60
 #当达到128000000bytes时，创建新文件 127*1024*1024（in bytes）
 #实际环境中如果按照128M回滚文件,那么这里设置一般设置成127M
 a2.sinks.k2.hdfs.rollSize=128000000
 #设置文件的生成不和events数相关
 a2.sinks.k2.hdfs.rollCount=0
 #设置成1，否则当有副本复制时就重新生成文件，上面三条则没有效果
 a2.sinks.k2.hdfs.minBlockReplicas=1

 //通过channel将sources和sinks连接起来
 # bind the sources and sinks to the channels
 a2.sources.r2.channels = c2
 a2.sinks.k2.channel = c2


 ===================================

2.安装Apache HTTP服务器程序用于生成网站日志文件

2.1 安装Apache HTTP

# yum -y install httpd

2.2 启动httpd服务

# service httpd start

2.3 编辑一个静态的html的页面

# vi /var/www/html/index.html

this is a test html

2.4 浏览器输入主机名访问这个页面

vampire04

2.5 实时监控httpd日志

# chmod -R 777 /var/log/httpd