以下内容均来自Flume官网的使用文档:

http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-sinks

source

Source是负责接收数据到Flume Agent的组件。Source组件可以处理各种类型、各种格式的日志数据,包括avro、thrift、exec、jms、spooling directory、netcat、 taildir 、sequence generator、syslog、http、legacy。

Avro Source

监听Avro端口并从外部的Avro客户端流接收事件。当与另一个(前一个跃点)Flume代理上的内置Avro接收器配对时,它可以创建分层的集合拓扑。所需属性以粗体显示。

Property Name

Default

Description

channels


type


The component type name, needs to be avro

bind


hostname or IP address to listen on

port


Port # to bind to

threads


Maximum number of worker threads to spawn

selector.type

selector.*

interceptors


Space-separated list of interceptors

interceptors.*

compression-type

none

This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource

ssl

false

Set this to true to enable SSL encryption. If SSL is enabled, you must also specify a “keystore” and a “keystore-password”, either through component level parameters (see below) or as global SSL parameters (see SSL/TLS support section).

keystore


This is the path to a Java keystore file. If not specified here, then the global keystore will be used (if defined, otherwise configuration error).

keystore-password


The password for the Java keystore. If not specified here, then the global keystore password will be used (if defined, otherwise configuration error).

keystore-type

JKS

The type of the Java keystore. This can be “JKS” or “PKCS12”. If not specified here, then the global keystore type will be used (if defined, otherwise the default is JKS).

exclude-protocols

SSLv3

Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified.

include-protocols


Space-separated list of SSL/TLS protocols to include. The enabled protocols will be the included protocols without the excluded protocols. If included-protocols is empty, it includes every supported protocols.

exclude-cipher-suites


Space-separated list of cipher suites to exclude.

include-cipher-suites


Space-separated list of cipher suites to include. The enabled cipher suites will be the included cipher suites without the excluded cipher suites. If included-cipher-suites is empty, it includes every supported cipher suites.

ipFilter

false

Set this to true to enable ipFiltering for netty

ipFilterRules


Define N netty ipFilter pattern rules with this config.

实例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

Exec Source

Exec源代码在启动时运行给定的Unix命令,并期望该过程在标准输出上连续产生数据(除非将属性logStdErr设置为true,否则将直接丢弃stderr)。 如果该过程由于某种原因而退出,则源也将退出,并且将不再产生任何数据。 这意味着诸如cat [命名管道]或tail -F [file]之类的配置将产生期望的结果,而日期可能不会产生-前两个命令产生数据流,而后者则产生单个事件并退出。

Property Name

Default

Description

channels


type


The component type name, needs to be exec

command


The command to execute

shell


A shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc.

restartThrottle

10000

Amount of time (in millis) to wait before attempting a restart

restart

false

Whether the executed cmd should be restarted if it dies

logStdErr

false

Whether the command’s stderr should be logged

batchSize

20

The max number of lines to read and send to the channel at a time

batchTimeout

3000

Amount of time (in milliseconds) to wait, if the buffer size was not reached, before data is pushed downstream

selector.type

replicating

replicating or multiplexing

selector.*

Depends on the selector.type value

interceptors


Space-separated list of interceptors

interceptors.*

实例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1

Spooling Directory Source

他的源允许您通过将要摄取的文件放置到磁盘上的假脱机目录中来摄取数据。该源将监视指定的目录中是否有新文件,并在新文件出现时从新文件中解析事件。事件解析逻辑是可插入的。在将给定的文件完全读入通道后,默认情况下,通过重命名该文件来表示完成,或者可以删除该文件,或者使用trackerDir跟踪已处理的文件。

与Exec源不同,此源是可靠的,即使Flume重新启动或终止,它也不会丢失数据。 为了获得这种可靠性,只能将不可变的唯一命名的文件拖放到假脱机目录中。 Flume尝试检测这些问题情况,如果被违反,将大声失败:

1:如果一个文件在放入假脱机目录后被写入,Flume将在其日志文件中打印一个错误并停止处理。

2:如果某个文件名在以后被重用,Flume将在其日志文件中打印一个错误并停止处理。

为避免上述问题,将唯一的标识符(例如时间戳)添加到日志文件名称(当它们移至假脱机目录中时)可能会很有用。

尽管该源保证了可靠性,但在某些情况下,如果下游发生某些故障,事件可能会重复发生。这是符合其他水槽组件提供的担保。

Property Name

Default

Description

channels


type


The component type name, needs to be spooldir.

spoolDir


The directory from which to read files from.

fileSuffix

.COMPLETED

Suffix to append to completely ingested files

deletePolicy

never

When to delete completed files: never or immediate

fileHeader

false

Whether to add a header storing the absolute path filename.

fileHeaderKey

file

Header key to use when appending absolute path filename to event header.

basenameHeader

false

Whether to add a header storing the basename of the file.

basenameHeaderKey

basename

Header Key to use when appending basename of file to event header.

实例

a1.channels = ch-1
a1.sources = src-1

a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true

Taildir Source

监视指定的文件,一旦检测到附加到每个文件的新行,就几乎实时跟踪它们。如果新线路正在写,这源将重试阅读他们在等待完成写作。

此源是可靠的,即使拖尾文件旋转也不会丢失数据。 它定期以JSON格式将每个文件的最后读取位置写入给定位置文件。 如果Flume由于某种原因停止或停机,它可以从写入现有位置文件中的位置重新开始拖尾。

在其他用例中,此源也可以使用给定位置文件从每个文件的任意位置开始拖尾。 当指定路径上没有位置文件时,默认情况下它将从每个文件的第一行开始拖尾。

Property Name

Default

Description

channels


type


The component type name, needs to be TAILDIR.

filegroups


Space-separated list of file groups. Each file group indicates a set of files to be tailed.

filegroups.


Absolute path of the file group. Regular expression (and not file system patterns) can be used for filename only.

positionFile

~/.flume/taildir_position.json

File in JSON format to record the inode, the absolute path and the last position of each tailing file.

headers..


Header value which is the set with header key. Multiple headers can be specified for one file group.

byteOffsetHeader

false

Whether to add the byte offset of a tailed line to a header called ‘byteoffset’.

skipToEnd

false

Whether to skip the position to EOF in the case of files not written on the position file.

idleTimeout

120000

Time (ms) to close inactive files. If the closed file is appended new lines to, this source will automatically re-open it.

writePosInterval

3000

Interval time (ms) to write the last position of each file on the position file.

batchSize

100

Max number of lines to read and send to the channel at a time. Using the default is usually fine.

实例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000

NetCat TCP Source

类似于netcat的源,它在给定的端口上侦听并将文本的每一行转换为一个事件。 行为类似于nc -k -l [host] [port]。 换句话说,它打开指定的端口并侦听数据。 期望提供的数据是换行符分隔的文本。 每一行文本都变成Flume事件,并通过连接的通道发送。

必需的属性以粗体显示。

Property Name

Default

Description

channels


type


The component type name, needs to be netcat

bind


Host name or IP address to bind to

port


Port # to bind to

max-line-length

512

Max line length per event body (in bytes)

ack-every-event

true

Respond with an “OK” for every event received

selector.type

replicating

replicating or multiplexing

selector.*

Depends on the selector.type value

interceptors


Space-separated list of interceptors

interceptors.*

案例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1

NetCat UDP Source

按照原始的Netcat(TCP)源,此源在给定的端口上侦听并将文本的每一行转换为一个事件,并通过连接的通道发送。 行为类似于nc -u -k -l [host] [port]。

必需的属性以粗体显示。

Property Name

Default

Description

channels


type


The component type name, needs to be netcatudp

bind


Host name or IP address to bind to

port


Port # to bind to

remoteAddressHeader


selector.type

replicating

replicating or multiplexing

selector.*

Depends on the selector.type value

interceptors


Space-separated list of interceptors

interceptors.*

实例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcatudp
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1

Channel

Channel是位于Source和Sink之间的缓冲区。因此,Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的,可以同时处理几个Source的写入操作和几个Sink的读取操作。

Flume自带两种Channel:Memory Channel和File Channel。

Memory Channel是内存中的队列。Memory Channel在不需要关心数据丢失的情景下适用。如果需要关心数据丢失,那么Memory Channel就不应该使用,因为程序死亡、机器宕机或者重启都会导致数据丢失。

File Channel将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。

Memory Channel

事件存储在内存队列中,该队列具有可配置的最大大小。 对于需要更高吞吐量并准备在代理发生故障时丢失分段数据的流而言,它是理想的选择。 必需的属性以粗体显示。

Property Name

Default

Description

type


The component type name, needs to be memory

capacity

100

The maximum number of events stored in the channel

transactionCapacity

100

The maximum number of events the channel will take from a source or give to a sink per transaction

keep-alive

3

Timeout in seconds for adding or removing an event

byteCapacityBufferPercentage

20

Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below.

byteCapacity

see description

Maximum total bytes of memory allowed as a sum of all events in this channel. The implementation only counts the Event body, which is the reason for providing the byteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB.

案例

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

File Channel

Property Name Default

Description

type


The component type name, needs to be file.

checkpointDir

~/.flume/file-channel/checkpoint

The directory where checkpoint file will be stored

useDualCheckpoints

false

Backup the checkpoint. If this is set to true, backupCheckpointDir must be set

backupCheckpointDir


The directory where the checkpoint is backed up to. This directory must not be the same as the data directories or the checkpoint directory

dataDirs

~/.flume/file-channel/data

Comma separated list of directories for storing log files. Using multiple directories on separate disks can improve file channel peformance

transactionCapacity

10000

The maximum size of transaction supported by the channel

checkpointInterval

30000

Amount of time (in millis) between checkpoints

maxFileSize

2146435071

Max size (in bytes) of a single log file

minimumRequiredSpace

524288000

Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value

capacity

1000000

Maximum capacity of the channel

keep-alive

3

Amount of time (in sec) to wait for a put operation

use-log-replay-v1

false

Expert: Use old replay logic

use-fast-replay

false

Expert: Replay without using queue

checkpointOnClose

true

Controls if a checkpoint is created when the channel is closed. Creating a checkpoint on close speeds up subsequent startup of the file channel by avoiding replay.

encryption.activeKey


Key name used to encrypt new data

encryption.cipherProvider


Cipher provider type, supported types: AESCTRNOPADDING

encryption.keyProvider


Key provider type, supported types: JCEKSFILE

encryption.keyProvider.keyStoreFile


Path to the keystore file

encrpytion.keyProvider.keyStorePasswordFile


Path to the keystore password file

encryption.keyProvider.keys


List of all keys (e.g. history of the activeKey setting)

encyption.keyProvider.keys.*.passwordFile


Path to the optional key password file

注意:默认情况下,文件通道使用上面指定的用户目录中检查点和数据目录的路径。 因此,如果代理中有多个活动的文件通道实例,则只有一个能够锁定目录并导致另一个通道初始化失败。 因此,有必要提供到所有已配置通道的显式路径,最好是在不同的磁盘上。 此外,由于文件通道将在每次提交后同步到磁盘,因此在没有多个磁盘可用于检查点和数据目录的情况下,将其与汇入事件的接收器/源耦合在一起对于提供良好的性能可能是必需的。

案例

a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data

Sink

Sink不断地轮询Channel中的事件且批量地移除它们,并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。

Sink组件目的地包括hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定义。

HDFS Sink

该接收器将事件写入Hadoop分布式文件系统(HDFS)。 当前,它支持创建文本和序列文件。 它支持两种文件类型的压缩。 可以根据经过的时间或数据大小或事件数定期滚动文件(关闭当前文件并创建一个新文件)。 它还按时间戳或事件发生的机器之类的属性对数据进行存储/分区。 HDFS目录路径可能包含格式化转义序列,这些序列将被HDFS接收器取代,以生成用于存储事件的目录/文件名。 使用此接收器需要安装hadoop,以便Flume可以使用Hadoop jar与HDFS群集进行通信。 请注意,需要支持sync()调用的Hadoop版本。

使用中的文件名将被修改为在末尾包含“ .tmp”。 关闭文件后,将删除此扩展名。 这样可以排除目录中部分完整的文件。 必需的属性以粗体显示。

注意!对于所有与时间相关的转义序列,带有键时间戳的标头必须存在于事件的标头中(除非hdfs)。useLocalTimeStamp被设置为true)。自动添加的一种方法是使用TimestampInterceptor。

Name

Default

Description

channel


type


The component type name, needs to be hdfs

hdfs.path


HDFS directory path (eg hdfs://namenode/flume/webdata/)

hdfs.filePrefix

FlumeData

Name prefixed to files created by Flume in hdfs directory

hdfs.fileSuffix


Suffix to append to file (eg .avro - NOTE: period is not automatically added)

hdfs.inUsePrefix


Prefix that is used for temporal files that flume actively writes into

hdfs.inUseSuffix

.tmp

Suffix that is used for temporal files that flume actively writes into

hdfs.emptyInUseSuffix

false

If false an hdfs.inUseSuffix is used while writing the output. After closing the output hdfs.inUseSuffix is removed from the output file name. If true the hdfs.inUseSuffix parameter is ignored an empty string is used instead.

hdfs.rollInterval

30

Number of seconds to wait before rolling current file (0 = never roll based on time interval)

hdfs.rollSize

1024

File size to trigger roll, in bytes (0: never roll based on file size)

hdfs.rollCount

10

Number of events written to file before it rolled (0 = never roll based on number of events)

hdfs.idleTimeout

0

Timeout after which inactive files get closed (0 = disable automatic closing of idle files)

hdfs.batchSize

100

number of events written to file before it is flushed to HDFS

hdfs.round

false

Should the timestamp be rounded down (if true, affects all time based escape sequences except %t)

hdfs.roundValue

1

Rounded down to the highest multiple of this (in the unit configured using hdfs.roundUnit), less than current time.

hdfs.roundUnit

second

The unit of the round down value - second, minute or hour.

hdfs.timeZone

Local Time

Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.

hdfs.useLocalTimeStamp

false

Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.

案例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

上面的配置会将时间戳四舍五入到最后10分钟。 例如,某个时间戳为2012年6月12日上午11:54:34的事件将导致hdfs路径变为/ flume / events / 2012-06-12 / 1150/00。

Logger Sink

在INFO级别记录事件。 通常用于测试/调试目的。 必需的属性以粗体显示。 该接收器是唯一不需要进行“记录原始数据”部分中说明的额外配置的例外。

Property Name

Default

Description

channel


type


The component type name, needs to be logger

maxBytesToLog

16

Maximum number of bytes of the Event body to log

案例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

Avro Sink

该接收器构成了Flume分层收集支持的一半。 发送到该接收器的Flume事件将转换为Avro事件,并发送到已配置的主机名/端口对。 这些事件是从已配置的通道中以已配置的批次大小批量获取的。 必需的属性以粗体显示。

Property Name

Default

Description

channel


type


The component type name, needs to be avro.

hostname


The hostname or IP address to bind to.

port


The port # to listen on.

batch-size

100

number of event to batch together for send.

connect-timeout

20000

Amount of time (ms) to allow for the first (handshake) request.

request-timeout

20000

Amount of time (ms) to allow for requests after the first.

案例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545

File Roll Sink

将事件存储在本地文件系统上。 必需的属性以粗体显示。

Property Name

Default

Description

channel


type


The component type name, needs to be file_roll.

sink.directory


The directory where files will be stored

sink.pathManager

DEFAULT

The PathManager implementation to use.

sink.pathManager.extension


The file extension if the default PathManager is used.

sink.pathManager.prefix


A character string to add to the beginning of the file name if the default PathManager is used

sink.rollInterval

30

Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file.

sink.serializer

TEXT

Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface.

sink.batchSize

100

案例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /var/log/flume

HBaseSink

该接收器将数据写入HBase。 Hbase配置是从类路径中遇到的第一个hbase-site.xml中提取的。 由配置指定的实现HbaseEventSerializer的类用于将事件转换为HBase放置和/或增量。 然后将这些推和增量写入HBase。 该接收器提供与HBase相同的一致性保证,HBase当前是按行原子性。 如果Hbase无法写入某些事件,则接收器将重播该事务中的所有事件。

HBaseSink支持将数据写入安全的HBase。 要写入安全的HBase,正在运行代理的用户必须对接收器配置为写入的表具有写入权限。 可以在配置中指定用于对KDC进行身份验证的主体和密钥表。 Flume代理的类路径中的hbase-site.xml必须将身份验证设置为kerberos(有关如何执行此操作的详细信息,请参阅HBase文档)。

为了方便起见,Flume随附了两个串行器。 SimpleHbaseEventSerializer(org.apache.flume.sink.hbase.SimpleHbaseEventSerializer)将事件主体原样写入HBase,并可选地增加Hbase中的列。 这主要是示例实现。 RegexHbaseEventSerializer(org.apache.flume.sink.hbase.RegexHbaseEventSerializer)根据给定的正则表达式中断事件正文,并将每个部分写入不同的列。

类型是FQCN:org.apache.flume.sink.hbase.HBaseSink。

必需的属性以粗体显示。

Property Name

Default

Description

channel


type


The component type name, needs to be hbase

table


The name of the table in Hbase to write to.

columnFamily


The column family in Hbase to write to.

zookeeperQuorum


The quorum spec. This is the value for the property hbase.zookeeper.quorum in hbase-site.xml

znodeParent

/hbase

The base path for the znode for the -ROOT- region. Value of zookeeper.znode.parent in hbase-site.xml

案例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hbase
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
a1.sinks.k1.channel = c1