Flume 安装配置 flume配置参数详解

转载

墨香四溢 2024-06-11 19:47:46

文章标签 Flume 安装配置 hadoop flume ci Source 文章分类 架构后端开发

以下内容均来自Flume官网的使用文档：

http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-sinks

source

Source是负责接收数据到Flume Agent的组件。Source组件可以处理各种类型、各种格式的日志数据，包括avro、thrift、exec、jms、spooling directory、netcat、 taildir 、sequence generator、syslog、http、legacy。

Avro Source

监听Avro端口并从外部的Avro客户端流接收事件。当与另一个(前一个跃点)Flume代理上的内置Avro接收器配对时，它可以创建分层的集合拓扑。所需属性以粗体显示。

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `avro`
bind	–	hostname or IP address to listen on
port	–	Port # to bind to
threads	–	Maximum number of worker threads to spawn
selector.type
selector.*
interceptors	–	Space-separated list of interceptors
interceptors.*
compression-type	none	This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource
ssl	false	Set this to true to enable SSL encryption. If SSL is enabled, you must also specify a “keystore” and a “keystore-password”, either through component level parameters (see below) or as global SSL parameters (see SSL/TLS support section).
keystore	–	This is the path to a Java keystore file. If not specified here, then the global keystore will be used (if defined, otherwise configuration error).
keystore-password	–	The password for the Java keystore. If not specified here, then the global keystore password will be used (if defined, otherwise configuration error).
keystore-type	JKS	The type of the Java keystore. This can be “JKS” or “PKCS12”. If not specified here, then the global keystore type will be used (if defined, otherwise the default is JKS).
exclude-protocols	SSLv3	Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified.
include-protocols	–	Space-separated list of SSL/TLS protocols to include. The enabled protocols will be the included protocols without the excluded protocols. If included-protocols is empty, it includes every supported protocols.
exclude-cipher-suites	–	Space-separated list of cipher suites to exclude.
include-cipher-suites	–	Space-separated list of cipher suites to include. The enabled cipher suites will be the included cipher suites without the excluded cipher suites. If included-cipher-suites is empty, it includes every supported cipher suites.
ipFilter	false	Set this to true to enable ipFiltering for netty
ipFilterRules	–	Define N netty ipFilter pattern rules with this config.

实例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

Exec Source

Exec源代码在启动时运行给定的Unix命令，并期望该过程在标准输出上连续产生数据（除非将属性logStdErr设置为true，否则将直接丢弃stderr）。如果该过程由于某种原因而退出，则源也将退出，并且将不再产生任何数据。这意味着诸如cat [命名管道]或tail -F [file]之类的配置将产生期望的结果，而日期可能不会产生-前两个命令产生数据流，而后者则产生单个事件并退出。

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `exec`
command	–	The command to execute
shell	–	A shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc.
restartThrottle	10000	Amount of time (in millis) to wait before attempting a restart
restart	false	Whether the executed cmd should be restarted if it dies
logStdErr	false	Whether the command’s stderr should be logged
batchSize	20	The max number of lines to read and send to the channel at a time
batchTimeout	3000	Amount of time (in milliseconds) to wait, if the buffer size was not reached, before data is pushed downstream
selector.type	replicating	replicating or multiplexing
selector.*		Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*

实例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1

Spooling Directory Source

他的源允许您通过将要摄取的文件放置到磁盘上的假脱机目录中来摄取数据。该源将监视指定的目录中是否有新文件，并在新文件出现时从新文件中解析事件。事件解析逻辑是可插入的。在将给定的文件完全读入通道后，默认情况下，通过重命名该文件来表示完成，或者可以删除该文件，或者使用trackerDir跟踪已处理的文件。

与Exec源不同，此源是可靠的，即使Flume重新启动或终止，它也不会丢失数据。为了获得这种可靠性，只能将不可变的唯一命名的文件拖放到假脱机目录中。 Flume尝试检测这些问题情况，如果被违反，将大声失败：

1：如果一个文件在放入假脱机目录后被写入，Flume将在其日志文件中打印一个错误并停止处理。

2：如果某个文件名在以后被重用，Flume将在其日志文件中打印一个错误并停止处理。

为避免上述问题，将唯一的标识符（例如时间戳）添加到日志文件名称（当它们移至假脱机目录中时）可能会很有用。

尽管该源保证了可靠性，但在某些情况下，如果下游发生某些故障，事件可能会重复发生。这是符合其他水槽组件提供的担保。

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `spooldir`.
spoolDir	–	The directory from which to read files from.
fileSuffix	.COMPLETED	Suffix to append to completely ingested files
deletePolicy	never	When to delete completed files: `never` or `immediate`
fileHeader	false	Whether to add a header storing the absolute path filename.
fileHeaderKey	file	Header key to use when appending absolute path filename to event header.
basenameHeader	false	Whether to add a header storing the basename of the file.
basenameHeaderKey	basename	Header Key to use when appending basename of file to event header.

实例

a1.channels = ch-1
a1.sources = src-1

a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true

Taildir Source

监视指定的文件，一旦检测到附加到每个文件的新行，就几乎实时跟踪它们。如果新线路正在写,这源将重试阅读他们在等待完成写作。

此源是可靠的，即使拖尾文件旋转也不会丢失数据。它定期以JSON格式将每个文件的最后读取位置写入给定位置文件。如果Flume由于某种原因停止或停机，它可以从写入现有位置文件中的位置重新开始拖尾。

在其他用例中，此源也可以使用给定位置文件从每个文件的任意位置开始拖尾。当指定路径上没有位置文件时，默认情况下它将从每个文件的第一行开始拖尾。

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `TAILDIR`.
filegroups	–	Space-separated list of file groups. Each file group indicates a set of files to be tailed.
filegroups.	–	Absolute path of the file group. Regular expression (and not file system patterns) can be used for filename only.
positionFile	~/.flume/taildir_position.json	File in JSON format to record the inode, the absolute path and the last position of each tailing file.
headers..	–	Header value which is the set with header key. Multiple headers can be specified for one file group.
byteOffsetHeader	false	Whether to add the byte offset of a tailed line to a header called ‘byteoffset’.
skipToEnd	false	Whether to skip the position to EOF in the case of files not written on the position file.
idleTimeout	120000	Time (ms) to close inactive files. If the closed file is appended new lines to, this source will automatically re-open it.
writePosInterval	3000	Interval time (ms) to write the last position of each file on the position file.
batchSize	100	Max number of lines to read and send to the channel at a time. Using the default is usually fine.

实例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000

NetCat TCP Source

类似于netcat的源，它在给定的端口上侦听并将文本的每一行转换为一个事件。行为类似于nc -k -l [host] [port]。换句话说，它打开指定的端口并侦听数据。期望提供的数据是换行符分隔的文本。每一行文本都变成Flume事件，并通过连接的通道发送。

必需的属性以粗体显示。

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `netcat`
bind	–	Host name or IP address to bind to
port	–	Port # to bind to
max-line-length	512	Max line length per event body (in bytes)
ack-every-event	true	Respond with an “OK” for every event received
selector.type	replicating	replicating or multiplexing
selector.*		Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*

案例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1

NetCat UDP Source

按照原始的Netcat（TCP）源，此源在给定的端口上侦听并将文本的每一行转换为一个事件，并通过连接的通道发送。行为类似于nc -u -k -l [host] [port]。

必需的属性以粗体显示。

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be `netcatudp`
bind	–	Host name or IP address to bind to
port	–	Port # to bind to
remoteAddressHeader	–
selector.type	replicating	replicating or multiplexing
selector.*		Depends on the selector.type value
interceptors	–	Space-separated list of interceptors
interceptors.*

实例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcatudp
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1

Channel

Channel是位于Source和Sink之间的缓冲区。因此，Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的，可以同时处理几个Source的写入操作和几个Sink的读取操作。

Flume自带两种Channel：Memory Channel和File Channel。

Memory Channel是内存中的队列。Memory Channel在不需要关心数据丢失的情景下适用。如果需要关心数据丢失，那么Memory Channel就不应该使用，因为程序死亡、机器宕机或者重启都会导致数据丢失。

File Channel将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。

Memory Channel

事件存储在内存队列中，该队列具有可配置的最大大小。对于需要更高吞吐量并准备在代理发生故障时丢失分段数据的流而言，它是理想的选择。必需的属性以粗体显示。

Property Name	Default	Description
type	–	The component type name, needs to be `memory`
capacity	100	The maximum number of events stored in the channel
transactionCapacity	100	The maximum number of events the channel will take from a source or give to a sink per transaction
keep-alive	3	Timeout in seconds for adding or removing an event
byteCapacityBufferPercentage	20	Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below.
byteCapacity	see description	Maximum total bytes of memory allowed as a sum of all events in this channel. The implementation only counts the Event `body`, which is the reason for providing the `byteCapacityBufferPercentage` configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to `0` will cause this value to fall back to a hard internal limit of about 200 GB.

案例

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

File Channel

Property Name Default	Description
type	–	The component type name, needs to be `file`.
checkpointDir	~/.flume/file-channel/checkpoint	The directory where checkpoint file will be stored
useDualCheckpoints	false	Backup the checkpoint. If this is set to `true`, `backupCheckpointDir` must be set
backupCheckpointDir	–	The directory where the checkpoint is backed up to. This directory must not be the same as the data directories or the checkpoint directory
dataDirs	~/.flume/file-channel/data	Comma separated list of directories for storing log files. Using multiple directories on separate disks can improve file channel peformance
transactionCapacity	10000	The maximum size of transaction supported by the channel
checkpointInterval	30000	Amount of time (in millis) between checkpoints
maxFileSize	2146435071	Max size (in bytes) of a single log file
minimumRequiredSpace	524288000	Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value
capacity	1000000	Maximum capacity of the channel
keep-alive	3	Amount of time (in sec) to wait for a put operation
use-log-replay-v1	false	Expert: Use old replay logic
use-fast-replay	false	Expert: Replay without using queue
checkpointOnClose	true	Controls if a checkpoint is created when the channel is closed. Creating a checkpoint on close speeds up subsequent startup of the file channel by avoiding replay.
encryption.activeKey	–	Key name used to encrypt new data
encryption.cipherProvider	–	Cipher provider type, supported types: AESCTRNOPADDING
encryption.keyProvider	–	Key provider type, supported types: JCEKSFILE
encryption.keyProvider.keyStoreFile	–	Path to the keystore file
encrpytion.keyProvider.keyStorePasswordFile	–	Path to the keystore password file
encryption.keyProvider.keys	–	List of all keys (e.g. history of the activeKey setting)
encyption.keyProvider.keys.*.passwordFile	–	Path to the optional key password file

注意：默认情况下，文件通道使用上面指定的用户目录中检查点和数据目录的路径。因此，如果代理中有多个活动的文件通道实例，则只有一个能够锁定目录并导致另一个通道初始化失败。因此，有必要提供到所有已配置通道的显式路径，最好是在不同的磁盘上。此外，由于文件通道将在每次提交后同步到磁盘，因此在没有多个磁盘可用于检查点和数据目录的情况下，将其与汇入事件的接收器/源耦合在一起对于提供良好的性能可能是必需的。

案例

a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data

Sink

Sink不断地轮询Channel中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。

Sink组件目的地包括hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定义。

HDFS Sink

该接收器将事件写入Hadoop分布式文件系统（HDFS）。当前，它支持创建文本和序列文件。它支持两种文件类型的压缩。可以根据经过的时间或数据大小或事件数定期滚动文件（关闭当前文件并创建一个新文件）。它还按时间戳或事件发生的机器之类的属性对数据进行存储/分区。 HDFS目录路径可能包含格式化转义序列，这些序列将被HDFS接收器取代，以生成用于存储事件的目录/文件名。使用此接收器需要安装hadoop，以便Flume可以使用Hadoop jar与HDFS群集进行通信。请注意，需要支持sync（）调用的Hadoop版本。

使用中的文件名将被修改为在末尾包含“ .tmp”。关闭文件后，将删除此扩展名。这样可以排除目录中部分完整的文件。必需的属性以粗体显示。

注意！对于所有与时间相关的转义序列，带有键时间戳的标头必须存在于事件的标头中(除非hdfs)。useLocalTimeStamp被设置为true)。自动添加的一种方法是使用TimestampInterceptor。

Name	Default	Description
channel	–
type	–	The component type name, needs to be `hdfs`
hdfs.path	–	HDFS directory path (eg hdfs://namenode/flume/webdata/)
hdfs.filePrefix	FlumeData	Name prefixed to files created by Flume in hdfs directory
hdfs.fileSuffix	–	Suffix to append to file (eg `.avro` - NOTE: period is not automatically added)
hdfs.inUsePrefix	–	Prefix that is used for temporal files that flume actively writes into
hdfs.inUseSuffix	`.tmp`	Suffix that is used for temporal files that flume actively writes into
hdfs.emptyInUseSuffix	false	If `false` an `hdfs.inUseSuffix` is used while writing the output. After closing the output `hdfs.inUseSuffix` is removed from the output file name. If `true` the `hdfs.inUseSuffix` parameter is ignored an empty string is used instead.
hdfs.rollInterval	30	Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize	1024	File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount	10	Number of events written to file before it rolled (0 = never roll based on number of events)
hdfs.idleTimeout	0	Timeout after which inactive files get closed (0 = disable automatic closing of idle files)
hdfs.batchSize	100	number of events written to file before it is flushed to HDFS

hdfs.round	false	Should the timestamp be rounded down (if true, affects all time based escape sequences except %t)
hdfs.roundValue	1	Rounded down to the highest multiple of this (in the unit configured using `hdfs.roundUnit`), less than current time.
hdfs.roundUnit	second	The unit of the round down value - `second`, `minute` or `hour`.
hdfs.timeZone	Local Time	Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
hdfs.useLocalTimeStamp	false	Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.

案例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

上面的配置会将时间戳四舍五入到最后10分钟。例如，某个时间戳为2012年6月12日上午11:54:34的事件将导致hdfs路径变为/ flume / events / 2012-06-12 / 1150/00。

Logger Sink

在INFO级别记录事件。通常用于测试/调试目的。必需的属性以粗体显示。该接收器是唯一不需要进行“记录原始数据”部分中说明的额外配置的例外。

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be `logger`
maxBytesToLog	16	Maximum number of bytes of the Event body to log

案例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

Avro Sink

该接收器构成了Flume分层收集支持的一半。发送到该接收器的Flume事件将转换为Avro事件，并发送到已配置的主机名/端口对。这些事件是从已配置的通道中以已配置的批次大小批量获取的。必需的属性以粗体显示。

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be `avro`.
hostname	–	The hostname or IP address to bind to.
port	–	The port # to listen on.
batch-size	100	number of event to batch together for send.
connect-timeout	20000	Amount of time (ms) to allow for the first (handshake) request.
request-timeout	20000	Amount of time (ms) to allow for requests after the first.

案例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545

File Roll Sink

将事件存储在本地文件系统上。必需的属性以粗体显示。

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be `file_roll`.
sink.directory	–	The directory where files will be stored
sink.pathManager	DEFAULT	The PathManager implementation to use.
sink.pathManager.extension	–	The file extension if the default PathManager is used.
sink.pathManager.prefix	–	A character string to add to the beginning of the file name if the default PathManager is used
sink.rollInterval	30	Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file.
sink.serializer	TEXT	Other possible options include `avro_event` or the FQCN of an implementation of EventSerializer.Builder interface.
sink.batchSize	100

案例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /var/log/flume

HBaseSink

该接收器将数据写入HBase。 Hbase配置是从类路径中遇到的第一个hbase-site.xml中提取的。由配置指定的实现HbaseEventSerializer的类用于将事件转换为HBase放置和/或增量。然后将这些推和增量写入HBase。该接收器提供与HBase相同的一致性保证，HBase当前是按行原子性。如果Hbase无法写入某些事件，则接收器将重播该事务中的所有事件。

HBaseSink支持将数据写入安全的HBase。要写入安全的HBase，正在运行代理的用户必须对接收器配置为写入的表具有写入权限。可以在配置中指定用于对KDC进行身份验证的主体和密钥表。 Flume代理的类路径中的hbase-site.xml必须将身份验证设置为kerberos（有关如何执行此操作的详细信息，请参阅HBase文档）。

为了方便起见，Flume随附了两个串行器。 SimpleHbaseEventSerializer（org.apache.flume.sink.hbase.SimpleHbaseEventSerializer）将事件主体原样写入HBase，并可选地增加Hbase中的列。这主要是示例实现。 RegexHbaseEventSerializer（org.apache.flume.sink.hbase.RegexHbaseEventSerializer）根据给定的正则表达式中断事件正文，并将每个部分写入不同的列。

类型是FQCN：org.apache.flume.sink.hbase.HBaseSink。

必需的属性以粗体显示。

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be `hbase`
table	–	The name of the table in Hbase to write to.
columnFamily	–	The column family in Hbase to write to.
zookeeperQuorum	–	The quorum spec. This is the value for the property `hbase.zookeeper.quorum` in hbase-site.xml
znodeParent	/hbase	The base path for the znode for the -ROOT- region. Value of `zookeeper.znode.parent` in hbase-site.xml

案例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hbase
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
a1.sinks.k1.channel = c1

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：ios network_info_plus 获取wifi名称需要那些配置获取wifi列表

下一篇：apipost 接入spring boot springboot调用api接口

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯