ELK借助ElastAlert实现故障提前感知预警功能

精选原创

脚下的路 2018-05-28 13:37:07 博主文章分类：ELK ©著作权

©著作权归作者所有：来自51CTO博客作者脚下的路的原创作品，请联系作者获取转载授权，否则将追究法律责任

本人公众号，欢迎来留言。

引子：

监控系统对于任何的业务系统来说都是非常重要的，很多时候它能够让我们及时的治疗线上的问题，避免更大的问题产生，但是现在的监控系统基本都是基于问题发生了之后，虽然也可以利用性能方面的产生做到提前的预知，但是有效性上可能能就没有那么高。那怎么提高提前的感知能力呢？怎么让问题消灭在萌芽中呢？啰嗦一下我们引入一个中医的概念，中医界公认最牛逼的医生是治疗“未病”的医生。所谓治疗“未病”百度百科给出如下解释：“上医治未病”最早源自于《黄帝内经》所说：“上工治未病，不治已病，此之谓也”。“治”，为治理管理的意思。“治未病”即采取相应的措施，防止疾病的发生发展。其在中医中的主要思想是：未病先防和既病防变。” 那我们在运维监控中是不是也可以用这样的思想昵？答案是当然可以了，elk 不仅帮我们把日志收集，存储，分析，图形化，我们还可以深度挖掘其中的有用数据，可以把网络设备日志，server idrac 卡日志，os 日志，应用日志等等，通过预定的错误关键字符串来达到刚刚讲的治疗“未病”的目的。不管是物理硬件还是系统层，还是应用层，错误的第一反馈点就是日志，我们把日志充分的挖掘就能够达到有错误先感知，提前处理问题，不让真正的问题发生，这样就能够更加高效的办公，不至于出现问题手忙脚乱的应对了。同时以后可以结合NB的AI 算法提高准确率，那就更加美美嗒！

#该项目github 官方地址及官方doc，供各位客官参考。 https://github.com/Yelp/elastalert https://elastalert.readthedocs.io/en/latest/

开搞

系统：centos 7

1，elastalert 需要一个固定的文件夹，以后执行程序的时候需要在这个目录中执行，一般情况下都会放到/usr/local/ 下,你可以自定义；

2，通过git 下载elastalert 包，首先要求你的系统里边有git ，如果没有yum 装一下

cd /usr/local/ 
git clone https://github.com/Yelp/elastalert.git

3，安装elastalert 需要依赖的包，这个步骤必须做，不做的话会报错哦！

yum install gcc libffi-devel python-devel openssl-devel4

4，通过pip 安装setuptools的指定版本，这个是elastalert 要求的最低版本，pip 如果没有的话通过east_install install pip 安装一下就行

pip install setuptools==1.1.6

5，现在可以进入elastalert 目录开搞了

cd elastalert

6，Elastalert 已经把它依赖的包写到了requirement.txt 我们只是需要执行下边的命令，就会把所有的依赖包安装上

pip install -r requirements.txt

7，万事ok 只欠 elastalert 了，执行完以下的命令我们今天的事情就。。。。。。。还有很长的路要走。

pip install elastalert

8，安装完成之后是这样样子滴！

9，elastalert 安装完成之后系统里边会有多出如下三个命令：

  elastalert-create-index  命令用来创建ES索引的，默认为elastalert_status
  elastalert-test-rule      测试自定义配置中的rule设置
  elastalert-rule-from-kibana   从Kibana3中直接导出Filters

10，执行如下命令在elasticsearch中创建elastalert的日志索引

elastalert-create-index 根据自己的情况，填入elasticsearch的相关信息，关于 elastalert_status部分直接回车默认的即可。如下图，主要是第一个和第二个，其余的根据自己的情况来。

11，创建配置文件，在elastalert 目录里边有一个config.yaml.example 文件，我们通过copy 一份之后修改成自己需要的配置

  cp  config.yaml.example config.yaml
   vi config.yaml

#存放elastalert 规则的文件夹，你的elastalert 放到哪里就放到哪里就行了
# This is the folder that contains the rule yaml files
# Any .yaml file will be loaded as a rule
rules_folder: /usr/local/elastalert/example_rules

#Elastalert 多久去查询一下根据定义的规则去elasticsearch 查询是否有符合规则的字段，如果有就会触发报警，如果没有就等待下一次时间再检查，时间定义的单位从周到秒都可以，具体定义方法如下。
# How often ElastAlert will query Elasticsearch
# The unit can be anything from weeks to seconds
run_every:
  #seconds：1
  minutes: 1
  #hours：1
  #days：1
  #weeks：1

#当查询开始一直到结束，最大的缓存时间。
# ElastAlert will buffer results from the most recent
# period of time, in case some log sources are not in real time
buffer_time:
  minutes: 15

#你的Elasticsearch ip地址
# The Elasticsearch hostname for metadata writeback
# Note that every rule can have its own Elasticsearch host
es_host: 192.168.115.65

#Elasticsearch 的端口
# The Elasticsearch port
es_port: 9200

#是不是用TLS 加密
# Connect with TLS to Elasticsearch
#use_ssl: True

#是不是启动TLS证书验证
# Verify TLS certificates
#verify_certs: True

#如果Elasticsearch 有认证的话需要把这个填写上
# Option basic-auth username and password for Elasticsearch
#es_username: someusername
#es_password: somepassword

#配置证书存放的位置
# Use SSL authentication with client certificates client_cert must be
# a pem file containing both cert and key for client
#verify_certs: True
#ca_certs: /path/to/cacert.pem
#client_cert: /path/to/client_cert.pem
#client_key: /path/to/client_key.key

#这个是elastalert 在es里边写的index
# The index on es_host which is used for metadata storage
# This can be a unmapped index, but it is recommended that you run
# elastalert-create-index to set a mapping
writeback_index: elastalert_status

#如果alert当时没有发出去重试多久之后放弃发送；
# If an alert fails for some reason, ElastAlert will retry
# sending the alert until this time period has elapsed
alert_time_limit:
  days: 2

##rules 的定义 cd example_rules/ cp example_frequency.yaml my_rule.yaml vi my_rule.yaml


# Alert when the rate of events exceeds a threshold
#Elasticsearch  机器
# (Optional)
# Elasticsearch host
es_host: 192.168.115.65

#Elasticsearch  端口
# (Optional)
# Elasticsearch port
es_port: 9200

#是否使用ssl 链接
# (OptionaL) Connect with SSL to Elasticsearch
#use_ssl: True

#如果elasticsearch 有认证，填写用户名和密码的地方
# (Optional) basic-auth username and password for Elasticsearch
#es_username: someusername
#es_password: somepassword

#rule name 必须是独一的，不然会报错，这个定义完成之后，会成为报警邮件的标题
# (Required)
# Rule name, must be unique
name: xx-xx-alert

#配置一种数据验证的方式，有 any，blacklist，whitelist，change，frequency，spike，flatline，new_term，cardinality 
any：只要有匹配就报警；
blacklist：compare_key字段的内容匹配上 blacklist数组里任意内容；
whitelist：compare_key字段的内容一个都没能匹配上whitelist数组里内容；
change：在相同query_key条件下，compare_key字段的内容，在 timeframe范围内 发送变化；
frequency：在相同 query_key条件下，timeframe 范围内有num_events个被过滤出 来的异常；
spike：在相同query_key条件下，前后两个timeframe范围内数据量相差比例超过spike_height。其中可以通过spike_type设置具体涨跌方向是- up,down,both 。还可以通过threshold_ref设置要求上一个周期数据量的下限，threshold_cur设置要求当前周期数据量的下限，如果数据量不到下限，也不触发；
flatline：timeframe 范围内，数据量小于threshold 阈值；
new_term：fields字段新出现之前terms_window_size(默认30天)范围内最多的terms_size (默认50)个结果以外的数据；
cardinality：在相同 query_key条件下，timeframe范围内cardinality_field的值超过 max_cardinality 或者低于min_cardinality
# (Required)
# Type of alert.
# the frequency rule type alerts when num_events events occur with timeframe time
#我配置的是frequency，这个需要两个条件满足，在相同 query_key条件下，timeframe 范围内有num_events个被过滤出来的异常
type: frequency 

#这个index 是指再kibana 里边的index，支持正则匹配，支持多个index，同时如果嫌麻烦直接* 也可以。
# (Required)
# Index to search, wildcard supported
index: es-nginx*,winlogbeat*

#时间出发的次数
# (Required, frequency specific)
# Alert when this many documents matching the query occur within a timeframe
num_events: 5

#和上边的参数关联，也就是说在4分钟内出发5次会报警
# (Required, frequency specific)
# num_events must occur within this amount of time to trigger an alert
timeframe:
  minutes: 4

#这个还是非常关键的地方，就是你希望程序的message里边出现了什么样的关键字就报警，这个其实就是elasticsearch 的query语句，支持 AND&OR等。
# (Required)
# A list of Elasticsearch filters used for find events
# These filters are joined with AND and nested in a filtered query
# For more info: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html
filter:
- query:
    query_string: 
      query: "message: 错误  OR Error"
#一但需要报警用那种方式报警，支持如下的方式，同时官方支持自定义，我用常规的邮件方式作为报警方式。
#自定义alert 方式
https://elastalert.readthedocs.io/en/latest/recipes/adding_rules.html#writingrules
https://elastalert.readthedocs.io/en/latest/recipes/adding_alerts.html#writingalerts 

Command
Email
JIRA
OpsGenie
SNS
HipChat
Slack
Telegram
Debug
Stomp
# (Required)
# The alert is use when a match is found
alert:
- "email"
#在邮件正文会显示你定义的alert_text
alert_text: "Ref Log http://192.168.115.65"
#报警邮箱的smtp server
smtp_host: smtp.126.com
#报警邮箱的smtp 端口
smtp_port: 25
#需要把认证信息写到额外配置文件里，需要user和password两个属性
smtp_auth_file: /usr/local/elastalert/example_rules/smtp_auth_file.yaml
email_reply_to:test@126.com
from_addr: test@126.com


#接受报警邮箱的地址,可以写多个，当然后边搞个邮件组最好了。
# (required, email specific)
# a list of email addresses to send alerts to
email:
- "test@126.com"
- "test1@126.com"

接下来我们需要配置smtp认证文件了，touch 一个配置文件 vi smtp_auth_file.yaml #文件配置内容如下：

user: "test"
password: "test@12345"

12，通过elastalert-test-rule 测试一下我们写的rule 是否有问题

elastalert-test-rule my_rule.yaml 测试结果如下，如果有问题会提示问题，如果没有问题就会告诉你successfully。

#我修改了一下配置文件，把elasticsearch的端口改成9800 之后会有如下报错：

13，配置检查成功之后，我们就可以把程序跑起来了，把所有的日志直接打在前端，这样方便验证

python -m elastalert.elastalert --verbose --rule my_rule.yaml

##自己生成一个错误，验证一下是否能够在预设的前提下报警，我的测试结果如下图，如第二箭头处提示已经触发报警，邮件已经发出了。

#检验结果的时候到了，看看邮箱里边有没有报警，俺滴神啊，邮箱里边有没有。。。！啊哈，有了如下图，elastalert 是每分钟去elasticsearch中去查询一次，num_hits 是我们指定的index 过去一分钟中有多少调条日志产生，num_matches 是指有多少条符合了我们的过滤规则。到这里这项伟大的工程基本就结束了，余下的就是收尾了！正好抓到有人尝试登陆我机器的记录，这个就体现出来日志报警的重要性了！

14，接下来我们有两种方式可以保证elastalert 正常的在后台运行，第一种是系统服务的方式，第二种是采用supervisor方式，先来聊聊第一种方式；

#在etc 下创建程序工作目录 mkdir -p /etc/elastalert/rules #进入工作目录复制刚刚创建好的配置文件 cd /etc/elastalert/ cp /usr/local/elastalert/config.yaml config.yaml #进入rules 目录复制rule 文件及smtp 认证文件 cp /usr/local/elastalert/example_rules/my_rule.yaml my_rule.yaml cp /usr/local/elastalert/example_rules/smtp_auth_file.yaml smtp_auth_file.yaml

#接下来我们需要修改配置文件中涉及到相应配置文件目录了，修改 config.yaml 中 rules_folder:/etc/elastalert/rules 修改my_rules.yaml 中 smtp_auth_file: /etc/elastalert/rules/smtp_auth_file.yaml

#接下来就是创建systemd服务了 cd /etc/systemd/system/ vi elastalertd.service

[Unit]
Description=elastalertd
After=elasticsearch.service

[Service]
Type=simple
User=root
Group=root
Restart=on-failure
WorkingDirectory=/usr/local/elastalert
ExecStart=/usr/bin/elastalert --config /etc/elastalert/config.yaml --rule /etc/elastalert/rules/my_rule.yaml
[Install]
WantedBy=multi-user.target

#关键时刻来临了，开启服务开机自启动 systemctl enable elastalertd #启动服务，check 服务启动状态

15，ok 第一种方式搞定了之后，我们来第二种启动方式，使用supervisor管理程序启动supervisor 的介绍请移步这里，

#进入到supervisor 目录 cd /etc/supervisor/conf.d/ #创建程序配置文件

vi elastalert.conf
[program:elastic-alert]
command = /usr/bin/python -m elastalert.elastalert --rule /etc/elastalert/rules/my_rule.yaml --verbose 
directory= /usr/local/elastalert
autostart = true
autorestart = true
startsecs = 5
startretries = 3
user = root
redirect_stderr = true
stdout_logfile=/data/logs/elk/elastic-std.log
stderr_logfile=/data/logs/elk/elastic-error.log

#通过supervisorctl 命令加载新创建的配置文件,执行如下命令 superviosrctl #进入交互模式,执行update 命令 update

#现在就可以通过web 管理界面来结果所有服务了，如下图：