EFK（elasticsearch + fluentd + kibana）日志系统-续2

原创

山西空管技术支持 2022-01-18 16:26:41 博主文章分类：AirNet-Linux-B-RedHat7.5 ©著作权

文章标签 elasticsearch github ruby 文章分类 运维

©著作权归作者所有：来自51CTO博客作者山西空管技术支持的原创作品，请联系作者获取转载授权，否则将追究法律责任

Fluentd不负责生产数据、不负责存储数据，只是数据的搬运工。支持文件轮转。

1、Fluentd stopped sending data to ES：（error="buffer space has too many data"）

https://github.com/uken/fluent-plugin-elasticsearch#declined-logs-are-resubmitted-forever-why

output.conf: |
    # Enriches records with Kubernetes metadata
    <filter kubernetes.**>
      @type kubernetes_metadata
    </filter>
    <match **>
      @id elasticsearch
      @type elasticsearch
      @log_level info
      include_tag_key true
      type_name _doc
      host "#{ENV['OUTPUT_HOST']}"
      port "#{ENV['OUTPUT_PORT']}"
      scheme "#{ENV['OUTPUT_SCHEME']}"
      ssl_version "#{ENV['OUTPUT_SSL_VERSION']}"
      logstash_format true
      logstash_prefix "#{ENV['LOGSTASH_PREFIX']}"
      reload_connections false
      reconnect_on_error true
      reload_on_failure true
      slow_flush_log_threshold 25.0
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        flush_interval 5s
        flush_thread_count 4
        chunk_full_threshold 0.9
        # retry_forever
#If you use /disable_retry_limit/ in v0.12 or /retry_forever/ in v0.14 or later, please be careful to consume memory inexhaustibly.
        retry_type exponential_backoff
        retry_timeout 1m
        retry_max_interval 30
        chunk_limit_size "#{ENV['OUTPUT_BUFFER_CHUNK_LIMIT']}"
        queue_limit_length "#{ENV['OUTPUT_BUFFER_QUEUE_LIMIT']}"
        overflow_action drop_oldest_chunk
      </buffer>
    </match>

fluentd daemonset failed to flush the buffer，fluent-plugin-elasticsearch reloads connection after 10000 requests. (Not correspond to events counts because ES plugin uses bulk API.)This functionality which is originated from elasticsearch-ruby gem is enabled by default.Sometimes this reloading functionality bothers（影响、干扰） users to send events with ES plugin.按以下修改match,buffer后OK。

reload_connections false      # defaults to true
reconnect_on_error true       # 默认false
reload_on_failure true        # defaults to false

因为每个事件数据量通常很小，考虑数据传输效率、稳定性等方面的原因，所以基本不会每条事件处理完后都会立马写入到output端，因此fluentd建立了缓冲模型，模型中主要有两个概念：

buffer_chunk：事件缓冲块，用来存储本地已经处理完待发送至目的端的事件，可以设置每个块的大小。
buffer_queue：存储chunk的队列，可以设置长度。

"chunks"：

A buffer is essentially a set of "chunks".缓冲区buffer本质上是一组“块”chunk。
A chunk is a collection of events concatenated into a single blob.块是串接到单个blob中的事件集合。Blob（Binary Large Object）表示二进制类型的大对象，表示一个不可变、原始数据的类文件对象。
Each chunk is managed one by one in the form of files (buf_file
) or continuous memory blocks (buf_memory
).

EFK（elasticsearch + fluentd + kibana）日志系统-续2_github

chunk的生命周期：

你可以把chunk想象成一个货箱。缓冲区插件使用chunk作为轻量级容器，并用输入源input sources传入的事件填充它。如果chunk已满，则它将被“运送”"shipped"到目标。

在内部，缓冲区插件buffer plugin 有两个单独的位置来存储chunks：

"stage" where chunks get filled with events（每个新创建的chunks都从"stage"开始，）

and "queue" where chunks wait before the transportation.（然后及时进入queue （随后被传输到目的地）。）

EFK（elasticsearch + fluentd + kibana）日志系统-续2_github_02

buffer_type，缓冲类型，可以设置file或者memory
buffer_chunk_limit，每个chunk块的大小，默认8MB，The value for option buffer_chunk_limit should not exceed value http.max_content_length in your Elasticsearch setup (by default it is 100MB).
buffer_queue_limit ，chunk块队列的最大长度，默认256
flush_interval ，flush一个chunk的时间间隔
retry_limit ，chunk块发送失败重试次数，默认17次，之后就丢弃该chunk数据。retry_max_times 17 # Maximum retry count before giving up
retry_wait ，重试发送chunk数据的时间间隔，默认1s，第2次失败再发送的话，间隔2s，下次4秒，以此类推。
retry_type，exponential_backoff指数退避或periodic定期；
retry_max_interval，在retry_type设置为exponential_backoff时，等待时间间隔可以限制在retry_max_interval指定范围内。
（Fluentd将在以下2种情况下中止传输失败chunks 的尝试：1.The number of retries exceeds retry_max_times (default: none)；2.The seconds elapsed since the first retry exceeds retry_timeout(default: 72h)）在这些事件中，队列queue中的所有chunks 都将被丢弃。如果想避免这种情况，可以启用retry_forever使Fluentd无限期重试。

随着fluentd事件的不断生成并写入chunk，缓存块持变大，当缓存块满足buffer_chunk_limit大小或者新的缓存块诞生超过flush_interval时间间隔后，会推入缓存queue队列尾部，该队列大小由buffer_queue_limit决定。每次有新的chunk入列，位于队列最前部的chunk块会立即写入配置的存储后端，比如配置的是kafka，则立即把数据推入kafka中。比较理想的情况是每次有新的缓存块进入缓存队列，则立马会被写入到后端，同时，新缓存块也持续入列，但是入列的速度不会快于出列的速度，这样基本上缓存队列处于空的状态，队列中最多只有一个缓存块。

但是实际情况考虑网络等因素，往往缓存块被写入后端存储的时候会出现延迟或者写入失败的情况，当缓存块写入后端失败时，该缓存块还会留在队列中，等retry_wait时间后重试发送，当retry的次数达到retry_limit后，该缓存块被销毁（数据被丢弃）。此时缓存队列持续有新的缓存块进来，如果队列中存在很多未及时写入到后端存储的缓存块的话，当队列长度达到buffer_queue_limit大小，则新的事件被拒绝，fluentd报错，error="buffer space has too many data"。

还有一种情况是网络传输缓慢的情况，若每3秒钟会产生一个新块，但是写入到后端时间却达到了30s钟，队列长度为100，那么每个块出列的时间内，又有新的10个块进来，那么队列很快就会被占满，导致异常出现。

Fluentd -plugin-elasticsearch扩展了Fluentd的内置输出插件，并使用compat_parameters插件助手，Buffer options：https://github.com/uken/fluent-plugin-elasticsearch#buffer-options

当Elasticsearch不能在默认的5秒内返回批量请求的响应时，这个参数将非常有用。

request_timeout 15s # defaults to 5s

2、在http://192.168.31.10:16608/kibana报错：

https://www.cnblogs.com/quqibinggan/p/15709454.html

[elasticsearch] failed to write data into buffer by buffer overflow action=:throw_exception

设置buffer参数chunk_limit_size 100M，假如溢出，这里定义标识为“block”: overflow_action block；

flush_thread_count参数适用于所有的output插件，如果Fluentd输出的日志目的地是一个远端的服务器或者服务，可打开配置文件中的flush_thread_count参数，此参数默认值为1。使用多个flush线程会掩盖网络延迟，增加并发输出。

3、日志收集器fluentd的关键功能之一是事件路由Event Routing。Fluentd依赖于tags来路由事件。可以通过三种方式重新路由Fluentd事件（ re-route Fluentd events）：

1) by tag using the fluent-plugin-route plugin, 
2) by label with the out_relabel plugin, or 
3) by record content with the fluent-plugin-rewrite-tag filter
#In order to have more than one sort of input, add another and @type with a specific tag,
  @type tail
  tag system.logs

注：“label”指令用来降低tag路由的复杂度，通过”label”指令可以用来组织filter和match的内部路由。