一、简介

公司内部重要的服务我们已经通过之前介绍的zabbix进行了监控,但是最近发生的一些情况促使我们重新完善监控方式。

十一放假期间一个晚上收到300多条报警短信,内容都是网络断开然后又重新连接,从报警的内容来看第一时间怀疑网络不稳定,但是当晚值班的人员(业务部门)反馈服务应用都正常,由此可以排除网络故障;随后怀疑监控系统由于MYSQL数据库导致性能不足,当时的想法是重启下数据库服务应该可以马上定位是否是性能原因,而当时专业维护人员都无法远程维护(假期间只带了手机),而这个问题又不值得中断假期马上维护。当时就想如果可以直接在网页上重新和查看自定义的服务该多好。

前些天又接到报障说业务系统无法访问而维护人员手机并没收到报警信息,经排查发现原来监控服务被人为停止。因此实现对监控服务的监控也势在必行。

1.1 Frigga简介

小米开源的进程监控工具(https://github.com/xiaomi-sa/frigga),Frigga是一款使用简单、极具扩展的进程监控的框架。她基于开源的god,修改和添加了web界面和rpc接口,以满足大集群服务管理的需求。

在北欧神话中,frigga是神后,odin的妻子;掌管婚姻和家庭;负责纺织云彩。

231251566.png

1.2 Frigga功能

集成了god,用来作为程序的supervise程序

C/S结构,并且集成了多种认证方式,以支持大集群运维管理

基本功能均提供api接口,方便扩展

支持单机web化的god,方便查看和管理

支持日志查看

支持添加自定义的xmlrpc接口,方便进行二次开发

1.3 Frigga环境依赖

Ruby 1.9.3和bundle

二、Frigga安装配置

2.1 配置YUM源

  先配置系统相关的yum源,如下所示:

wget http://mirrors.163.com/.help/CentOS6-Base-163.repo
mv CentOS6-Base-163.repo /etc/yum.repos.d/
wget http://mirrors.ustc.edu.cn/fedora/epel/6/x86_64/epel-release-6-8.noarch.rpm
wget http://pkgs.repoforge.org/rpmforge-release/rpmforge-release-0.5.3-1.el6.rf.x86_64.rpm
rpm -ivh epel-release-6-8.noarch.rpm
rpm -ivh rpmforge-release-0.5.3-1.el6.rf.x86_64.rpm
rpm -ivh http://rpms.famillecollet.com/enterprise/remi-release-6.rpm
rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-remi

232144573.png

2.2 安装Frigga

先安装Frigga相关的依赖包,如下所示:

yum install git   #安装git
yum -y install ruby gems  #安装ruby
curl -L https://get.rvm.io | bash -s stable  #升级ruby
rvm install ruby-1.9.3  #该步骤前可以需要重新登录
gem install bundle  #安装bundle
cd /opt/  #安装Frigga
git clone https://github.com/xiaomi-sa/frigga.git
cd /opt/frigga
./script/run.rb start  #启动Frigga
god status  #查看启动状态

233216164.png

233256791.png

2.3 配置Frigga

  默认情况下Frigga启动后开启WEB9001端口,用户名和密码的认证配置存放于/opt/frigga/conf/frigga.yml文件中,如下所示:

cat /opt/frigga/conf/frigga.yml
---
port: 9001
http_auth: ["admin", "password"]

200421358.png

2.4 配置SSH监控

配置文件,以.god后缀结尾,可以存放于gods文件夹下,以下的例子以CentOS6.4下的ssh服务为例。当ssh进程5分钟内start或者restart进程 5次,如果启动失败,修改状态为unmonitored,10分钟后再次尝试启动,如果2个小时内,尝试5次都失败,彻底放弃。

cd /opt/frigga/gods
vim sshd.god
God.watchdo |w|
  w.name = 'sshd'
w.start = "/etc/init.d/sshd start"
w.stop = "/etc/init.d/sshd stop"
w.restart = "/etc/init.d/sshd restart"
w.interval = 30.seconds
w.start_grace = 10.seconds
w.restart_grace = 10.seconds
w.pid_file = '/var/run/sshd.pid'
w.:clean_pid_file)
w.start_if  do |start|
start.condition(:process_running) do |c|
c.interval = 5.seconds
c.running = false
end
end
# lifecycle
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 5
c.within = 5.minute
c.transition= :unmonitored
c.retry_in = 10.minutes
c.retry_times = 5
c.retry_within = 2.hours
end
end
end

需要注意的是sshd命令的路径不要写错。完成以上设定后可以输入以下指令加载sshd的监控,如下所示:

god load sshd.god

200806653.png

2.5 配置apache监控

God.watchdo |w|
  w.name = 'apache'
w.start = "/etc/init.d/httpd start"
w.stop = "/etc/init.d/httpd stop"
w.restart = "/etc/init.d/httpd restart"
w.interval = 30.seconds
w.start_grace = 10.seconds
w.restart_grace = 10.seconds
w.pid_file = '/var/run/httpd/httpd.pid'
w.:clean_pid_file)
w.start_if  do |start|
start.condition(:process_running) do |c|
c.interval = 5.seconds
c.running = false
end
end
  # lifecycle
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 5
c.within = 5.minute
c.transition= :unmonitored
c.retry_in = 10.minutes
c.retry_times = 5
c.retry_within = 2.hours
end
end
end

201012105.png

201059195.png

2.7 配置MySQL监控

God.watchdo |w|
  w.name = 'mysql'
  w.start = "/etc/init.d/mysqld start"
  w.stop = "/etc/init.d/mysqld start"
  w.restart = "/etc/init.d/mysqld restart"
  w.interval = 30.seconds
  w.start_grace = 10.seconds
  w.restart_grace = 10.seconds
  w.pid_file = '/var/run/mysqld/mysqld.pid'
  w.:clean_pid_file)
   w.start_if do |start|
    start.condition(:process_running) do |c|
      c.interval = 5.seconds
      c.running = false
    end
  end
  # lifecycle
  w.lifecycle do |on|
    on.condition(:flapping) do |c|
      c.to_state = [:start, :restart]
      c.times = 5
      c.within = 5.minute
      c.transition= :unmonitored
      c.retry_in = 10.minutes
      c.retry_times = 5
      c.retry_within = 2.hours
    end
  end
end

201200258.png