以下内容均来自Flume官网的使用文档:
http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-sinks
source
Source是负责接收数据到Flume Agent的组件。Source组件可以处理各种类型、各种格式的日志数据,包括avro、thrift、exec、jms、spooling directory、netcat、 taildir 、sequence generator、syslog、http、legacy。
Avro Source
监听Avro端口并从外部的Avro客户端流接收事件。当与另一个(前一个跃点)Flume代理上的内置Avro接收器配对时,它可以创建分层的集合拓扑。所需属性以粗体显示。
Property Name | Default | Description |
channels | – | |
type | – | The component type name, needs to be |
bind | – | hostname or IP address to listen on |
port | – | Port # to bind to |
threads | – | Maximum number of worker threads to spawn |
selector.type | ||
selector.* | ||
interceptors | – | Space-separated list of interceptors |
interceptors.* | ||
compression-type | none | This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource |
ssl | false | Set this to true to enable SSL encryption. If SSL is enabled, you must also specify a “keystore” and a “keystore-password”, either through component level parameters (see below) or as global SSL parameters (see SSL/TLS support section). |
keystore | – | This is the path to a Java keystore file. If not specified here, then the global keystore will be used (if defined, otherwise configuration error). |
keystore-password | – | The password for the Java keystore. If not specified here, then the global keystore password will be used (if defined, otherwise configuration error). |
keystore-type | JKS | The type of the Java keystore. This can be “JKS” or “PKCS12”. If not specified here, then the global keystore type will be used (if defined, otherwise the default is JKS). |
exclude-protocols | SSLv3 | Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified. |
include-protocols | – | Space-separated list of SSL/TLS protocols to include. The enabled protocols will be the included protocols without the excluded protocols. If included-protocols is empty, it includes every supported protocols. |
exclude-cipher-suites | – | Space-separated list of cipher suites to exclude. |
include-cipher-suites | – | Space-separated list of cipher suites to include. The enabled cipher suites will be the included cipher suites without the excluded cipher suites. If included-cipher-suites is empty, it includes every supported cipher suites. |
ipFilter | false | Set this to true to enable ipFiltering for netty |
ipFilterRules | – | Define N netty ipFilter pattern rules with this config. |
实例
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
Exec Source
Exec源代码在启动时运行给定的Unix命令,并期望该过程在标准输出上连续产生数据(除非将属性logStdErr设置为true,否则将直接丢弃stderr)。 如果该过程由于某种原因而退出,则源也将退出,并且将不再产生任何数据。 这意味着诸如cat [命名管道]或tail -F [file]之类的配置将产生期望的结果,而日期可能不会产生-前两个命令产生数据流,而后者则产生单个事件并退出。
Property Name | Default | Description |
channels | – | |
type | – | The component type name, needs to be |
command | – | The command to execute |
shell | – | A shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc. |
restartThrottle | 10000 | Amount of time (in millis) to wait before attempting a restart |
restart | false | Whether the executed cmd should be restarted if it dies |
logStdErr | false | Whether the command’s stderr should be logged |
batchSize | 20 | The max number of lines to read and send to the channel at a time |
batchTimeout | 3000 | Amount of time (in milliseconds) to wait, if the buffer size was not reached, before data is pushed downstream |
selector.type | replicating | replicating or multiplexing |
selector.* | Depends on the selector.type value | |
interceptors | – | Space-separated list of interceptors |
interceptors.* |
实例
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1
Spooling Directory Source
他的源允许您通过将要摄取的文件放置到磁盘上的假脱机目录中来摄取数据。该源将监视指定的目录中是否有新文件,并在新文件出现时从新文件中解析事件。事件解析逻辑是可插入的。在将给定的文件完全读入通道后,默认情况下,通过重命名该文件来表示完成,或者可以删除该文件,或者使用trackerDir跟踪已处理的文件。
与Exec源不同,此源是可靠的,即使Flume重新启动或终止,它也不会丢失数据。 为了获得这种可靠性,只能将不可变的唯一命名的文件拖放到假脱机目录中。 Flume尝试检测这些问题情况,如果被违反,将大声失败:
1:如果一个文件在放入假脱机目录后被写入,Flume将在其日志文件中打印一个错误并停止处理。
2:如果某个文件名在以后被重用,Flume将在其日志文件中打印一个错误并停止处理。
为避免上述问题,将唯一的标识符(例如时间戳)添加到日志文件名称(当它们移至假脱机目录中时)可能会很有用。
尽管该源保证了可靠性,但在某些情况下,如果下游发生某些故障,事件可能会重复发生。这是符合其他水槽组件提供的担保。
Property Name | Default | Description |
channels | – | |
type | – | The component type name, needs to be |
spoolDir | – | The directory from which to read files from. |
fileSuffix | .COMPLETED | Suffix to append to completely ingested files |
deletePolicy | never | When to delete completed files: |
fileHeader | false | Whether to add a header storing the absolute path filename. |
fileHeaderKey | file | Header key to use when appending absolute path filename to event header. |
basenameHeader | false | Whether to add a header storing the basename of the file. |
basenameHeaderKey | basename | Header Key to use when appending basename of file to event header. |
实例
a1.channels = ch-1
a1.sources = src-1
a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true
Taildir Source
监视指定的文件,一旦检测到附加到每个文件的新行,就几乎实时跟踪它们。如果新线路正在写,这源将重试阅读他们在等待完成写作。
此源是可靠的,即使拖尾文件旋转也不会丢失数据。 它定期以JSON格式将每个文件的最后读取位置写入给定位置文件。 如果Flume由于某种原因停止或停机,它可以从写入现有位置文件中的位置重新开始拖尾。
在其他用例中,此源也可以使用给定位置文件从每个文件的任意位置开始拖尾。 当指定路径上没有位置文件时,默认情况下它将从每个文件的第一行开始拖尾。
Property Name | Default | Description |
channels | – | |
type | – | The component type name, needs to be |
filegroups | – | Space-separated list of file groups. Each file group indicates a set of files to be tailed. |
filegroups. | – | Absolute path of the file group. Regular expression (and not file system patterns) can be used for filename only. |
positionFile | ~/.flume/taildir_position.json | File in JSON format to record the inode, the absolute path and the last position of each tailing file. |
headers.. | – | Header value which is the set with header key. Multiple headers can be specified for one file group. |
byteOffsetHeader | false | Whether to add the byte offset of a tailed line to a header called ‘byteoffset’. |
skipToEnd | false | Whether to skip the position to EOF in the case of files not written on the position file. |
idleTimeout | 120000 | Time (ms) to close inactive files. If the closed file is appended new lines to, this source will automatically re-open it. |
writePosInterval | 3000 | Interval time (ms) to write the last position of each file on the position file. |
batchSize | 100 | Max number of lines to read and send to the channel at a time. Using the default is usually fine. |
实例
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000
NetCat TCP Source
类似于netcat的源,它在给定的端口上侦听并将文本的每一行转换为一个事件。 行为类似于nc -k -l [host] [port]。 换句话说,它打开指定的端口并侦听数据。 期望提供的数据是换行符分隔的文本。 每一行文本都变成Flume事件,并通过连接的通道发送。
必需的属性以粗体显示。
Property Name | Default | Description |
channels | – | |
type | – | The component type name, needs to be |
bind | – | Host name or IP address to bind to |
port | – | Port # to bind to |
max-line-length | 512 | Max line length per event body (in bytes) |
ack-every-event | true | Respond with an “OK” for every event received |
selector.type | replicating | replicating or multiplexing |
selector.* | Depends on the selector.type value | |
interceptors | – | Space-separated list of interceptors |
interceptors.* |
案例
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1
NetCat UDP Source
按照原始的Netcat(TCP)源,此源在给定的端口上侦听并将文本的每一行转换为一个事件,并通过连接的通道发送。 行为类似于nc -u -k -l [host] [port]。
必需的属性以粗体显示。
Property Name | Default | Description |
channels | – | |
type | – | The component type name, needs to be |
bind | – | Host name or IP address to bind to |
port | – | Port # to bind to |
remoteAddressHeader | – | |
selector.type | replicating | replicating or multiplexing |
selector.* | Depends on the selector.type value | |
interceptors | – | Space-separated list of interceptors |
interceptors.* |
实例
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcatudp
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1
Channel
Channel是位于Source和Sink之间的缓冲区。因此,Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的,可以同时处理几个Source的写入操作和几个Sink的读取操作。
Flume自带两种Channel:Memory Channel和File Channel。
Memory Channel是内存中的队列。Memory Channel在不需要关心数据丢失的情景下适用。如果需要关心数据丢失,那么Memory Channel就不应该使用,因为程序死亡、机器宕机或者重启都会导致数据丢失。
File Channel将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。
Memory Channel
事件存储在内存队列中,该队列具有可配置的最大大小。 对于需要更高吞吐量并准备在代理发生故障时丢失分段数据的流而言,它是理想的选择。 必需的属性以粗体显示。
Property Name | Default | Description |
type | – | The component type name, needs to be |
capacity | 100 | The maximum number of events stored in the channel |
transactionCapacity | 100 | The maximum number of events the channel will take from a source or give to a sink per transaction |
keep-alive | 3 | Timeout in seconds for adding or removing an event |
byteCapacityBufferPercentage | 20 | Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below. |
byteCapacity | see description | Maximum total bytes of memory allowed as a sum of all events in this channel. The implementation only counts the Event |
案例
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
File Channel
Property Name Default | Description | |
type | – | The component type name, needs to be |
checkpointDir | ~/.flume/file-channel/checkpoint | The directory where checkpoint file will be stored |
useDualCheckpoints | false | Backup the checkpoint. If this is set to |
backupCheckpointDir | – | The directory where the checkpoint is backed up to. This directory must not be the same as the data directories or the checkpoint directory |
dataDirs | ~/.flume/file-channel/data | Comma separated list of directories for storing log files. Using multiple directories on separate disks can improve file channel peformance |
transactionCapacity | 10000 | The maximum size of transaction supported by the channel |
checkpointInterval | 30000 | Amount of time (in millis) between checkpoints |
maxFileSize | 2146435071 | Max size (in bytes) of a single log file |
minimumRequiredSpace | 524288000 | Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value |
capacity | 1000000 | Maximum capacity of the channel |
keep-alive | 3 | Amount of time (in sec) to wait for a put operation |
use-log-replay-v1 | false | Expert: Use old replay logic |
use-fast-replay | false | Expert: Replay without using queue |
checkpointOnClose | true | Controls if a checkpoint is created when the channel is closed. Creating a checkpoint on close speeds up subsequent startup of the file channel by avoiding replay. |
encryption.activeKey | – | Key name used to encrypt new data |
encryption.cipherProvider | – | Cipher provider type, supported types: AESCTRNOPADDING |
encryption.keyProvider | – | Key provider type, supported types: JCEKSFILE |
encryption.keyProvider.keyStoreFile | – | Path to the keystore file |
encrpytion.keyProvider.keyStorePasswordFile | – | Path to the keystore password file |
encryption.keyProvider.keys | – | List of all keys (e.g. history of the activeKey setting) |
encyption.keyProvider.keys.*.passwordFile | – | Path to the optional key password file |
注意:默认情况下,文件通道使用上面指定的用户目录中检查点和数据目录的路径。 因此,如果代理中有多个活动的文件通道实例,则只有一个能够锁定目录并导致另一个通道初始化失败。 因此,有必要提供到所有已配置通道的显式路径,最好是在不同的磁盘上。 此外,由于文件通道将在每次提交后同步到磁盘,因此在没有多个磁盘可用于检查点和数据目录的情况下,将其与汇入事件的接收器/源耦合在一起对于提供良好的性能可能是必需的。
案例
a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data
Sink
Sink不断地轮询Channel中的事件且批量地移除它们,并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。
Sink组件目的地包括hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定义。
HDFS Sink
该接收器将事件写入Hadoop分布式文件系统(HDFS)。 当前,它支持创建文本和序列文件。 它支持两种文件类型的压缩。 可以根据经过的时间或数据大小或事件数定期滚动文件(关闭当前文件并创建一个新文件)。 它还按时间戳或事件发生的机器之类的属性对数据进行存储/分区。 HDFS目录路径可能包含格式化转义序列,这些序列将被HDFS接收器取代,以生成用于存储事件的目录/文件名。 使用此接收器需要安装hadoop,以便Flume可以使用Hadoop jar与HDFS群集进行通信。 请注意,需要支持sync()调用的Hadoop版本。
使用中的文件名将被修改为在末尾包含“ .tmp”。 关闭文件后,将删除此扩展名。 这样可以排除目录中部分完整的文件。 必需的属性以粗体显示。
注意!对于所有与时间相关的转义序列,带有键时间戳的标头必须存在于事件的标头中(除非hdfs)。useLocalTimeStamp被设置为true)。自动添加的一种方法是使用TimestampInterceptor。
Name | Default | Description |
channel | – | |
type | – | The component type name, needs to be |
hdfs.path | – | HDFS directory path (eg hdfs://namenode/flume/webdata/) |
hdfs.filePrefix | FlumeData | Name prefixed to files created by Flume in hdfs directory |
hdfs.fileSuffix | – | Suffix to append to file (eg |
hdfs.inUsePrefix | – | Prefix that is used for temporal files that flume actively writes into |
hdfs.inUseSuffix |
| Suffix that is used for temporal files that flume actively writes into |
hdfs.emptyInUseSuffix | false | If |
hdfs.rollInterval | 30 | Number of seconds to wait before rolling current file (0 = never roll based on time interval) |
hdfs.rollSize | 1024 | File size to trigger roll, in bytes (0: never roll based on file size) |
hdfs.rollCount | 10 | Number of events written to file before it rolled (0 = never roll based on number of events) |
hdfs.idleTimeout | 0 | Timeout after which inactive files get closed (0 = disable automatic closing of idle files) |
hdfs.batchSize | 100 | number of events written to file before it is flushed to HDFS |
hdfs.round | false | Should the timestamp be rounded down (if true, affects all time based escape sequences except %t) |
hdfs.roundValue | 1 | Rounded down to the highest multiple of this (in the unit configured using |
hdfs.roundUnit | second | The unit of the round down value - |
hdfs.timeZone | Local Time | Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles. |
hdfs.useLocalTimeStamp | false | Use the local time (instead of the timestamp from the event header) while replacing the escape sequences. |
案例
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
上面的配置会将时间戳四舍五入到最后10分钟。 例如,某个时间戳为2012年6月12日上午11:54:34的事件将导致hdfs路径变为/ flume / events / 2012-06-12 / 1150/00。
Logger Sink
在INFO级别记录事件。 通常用于测试/调试目的。 必需的属性以粗体显示。 该接收器是唯一不需要进行“记录原始数据”部分中说明的额外配置的例外。
Property Name | Default | Description |
channel | – | |
type | – | The component type name, needs to be |
maxBytesToLog | 16 | Maximum number of bytes of the Event body to log |
案例
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1
Avro Sink
该接收器构成了Flume分层收集支持的一半。 发送到该接收器的Flume事件将转换为Avro事件,并发送到已配置的主机名/端口对。 这些事件是从已配置的通道中以已配置的批次大小批量获取的。 必需的属性以粗体显示。
Property Name | Default | Description |
channel | – | |
type | – | The component type name, needs to be |
hostname | – | The hostname or IP address to bind to. |
port | – | The port # to listen on. |
batch-size | 100 | number of event to batch together for send. |
connect-timeout | 20000 | Amount of time (ms) to allow for the first (handshake) request. |
request-timeout | 20000 | Amount of time (ms) to allow for requests after the first. |
案例
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545
File Roll Sink
将事件存储在本地文件系统上。 必需的属性以粗体显示。
Property Name | Default | Description |
channel | – | |
type | – | The component type name, needs to be |
sink.directory | – | The directory where files will be stored |
sink.pathManager | DEFAULT | The PathManager implementation to use. |
sink.pathManager.extension | – | The file extension if the default PathManager is used. |
sink.pathManager.prefix | – | A character string to add to the beginning of the file name if the default PathManager is used |
sink.rollInterval | 30 | Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file. |
sink.serializer | TEXT | Other possible options include |
sink.batchSize | 100 |
案例
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /var/log/flume
HBaseSink
该接收器将数据写入HBase。 Hbase配置是从类路径中遇到的第一个hbase-site.xml中提取的。 由配置指定的实现HbaseEventSerializer的类用于将事件转换为HBase放置和/或增量。 然后将这些推和增量写入HBase。 该接收器提供与HBase相同的一致性保证,HBase当前是按行原子性。 如果Hbase无法写入某些事件,则接收器将重播该事务中的所有事件。
HBaseSink支持将数据写入安全的HBase。 要写入安全的HBase,正在运行代理的用户必须对接收器配置为写入的表具有写入权限。 可以在配置中指定用于对KDC进行身份验证的主体和密钥表。 Flume代理的类路径中的hbase-site.xml必须将身份验证设置为kerberos(有关如何执行此操作的详细信息,请参阅HBase文档)。
为了方便起见,Flume随附了两个串行器。 SimpleHbaseEventSerializer(org.apache.flume.sink.hbase.SimpleHbaseEventSerializer)将事件主体原样写入HBase,并可选地增加Hbase中的列。 这主要是示例实现。 RegexHbaseEventSerializer(org.apache.flume.sink.hbase.RegexHbaseEventSerializer)根据给定的正则表达式中断事件正文,并将每个部分写入不同的列。
类型是FQCN:org.apache.flume.sink.hbase.HBaseSink。
必需的属性以粗体显示。
Property Name | Default | Description |
channel | – | |
type | – | The component type name, needs to be |
table | – | The name of the table in Hbase to write to. |
columnFamily | – | The column family in Hbase to write to. |
zookeeperQuorum | – | The quorum spec. This is the value for the property |
znodeParent | /hbase | The base path for the znode for the -ROOT- region. Value of |
案例
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hbase
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
a1.sinks.k1.channel = c1