编译安装和配置nagios的全套监控系统笔记

原创

qubaoquan 2010-04-08 17:29:52 博主文章分类：监控 ©著作权

©著作权归作者所有：来自51CTO博客作者qubaoquan的原创作品，请联系作者获取转载授权，否则将追究法律责任

实验环境：nagios监控服务器为192.168.1.240
主机名为nagios
nagios被监控服务器为192.168.1.208
主机名为apache
所需软件包如下:
（zlib-1.2.3.tar.gz,libpng-1.2.10.tar.bz2,jpegsrc.v6b.tar.gz，
freetype-2.1.10.tar.gz gd-2.0.35.tar.gz）,httpd-2.2.6.tar.gz、
imagepak-base.tar.gz、mysql-5.0.75.tar.gz、nagios-3.1.0.tar.gz
nagios-plugins-1.4.13.tar.gz、nrpe-2.12.tar.gz，在监控服务器上进行下面的操作：
首先保证上面括号括起的软件包都已经安装完成（默认安装即可）
1.安装nagios主程序
1）安装nagios：
tar -zxvf nagios-3.10.tar.gz
cd nagios-3.10
./configure --prefix=/usr/local/nagios --with-gd-lib=/usr/local/lib
--with-gd-inc=/usr/local/include（必须指定gd 库的安装位置否则nagios 无法打开map
status）
2）创建用户并且设定权限：
groupadd nagios
useradd –g nagios nagios
mkdir /usr/local/nagios
chown –R nagios:nagios /usr/local/nagios
make all
make install//来安装主程序,CGI 和HTML 文件
make install-init//在/etc/rc.d/init.d 安装启动脚本
make install-commandmode//来配置目录权限
make install-config//来安装示例配置文件,安装的路径是/usr/local/nagios/etc.
3）验证是否安装成功：
验证程序是否被正确安装。切换目录到安装路径（这里是/usr/local/nagios）,看是否存在
etc、bin、sbin、share、var 这五个目录，如果存在则可以表明程序被正确的安装到系统
了。后表是五个目录功能的简要说明：
Bin Nagios 执行程序所在目录，nagios 文件即为主程序
Etc Nagios 配置文件位置，初始安装完后，只有几个*.cfg-sample 文件
Sbin NagiosCgi 文件所在目录，也就是执行外部命令所需文件所在的目录
Share Nagios 网页文件所在的目录
Var Nagios 日志文件、spid 等文件所在的目录
var/archives Empty directory for the archived logs
/var/rw Empty directory for the external commandfile
2.安装插件
1）解压缩:
tar -zxvf nagios-plugins-1.4.13.tar.gz
cd nagios-plugins-1.4.13
./configure--prefix=/usr/local/nagios/ --enable-redhat-pthread-workaround
(在redhat 系统上面安装可能出现configure 时，到这里checking for redhat spopen
problem...就不动了，所以需要在configure 时再 --enable-redhat-pthread-workaround)
make
make install
ls /usr/local/nagios/libexec/
会显示安装的插件文件,即所有的插件都安装在libexec 这个目录下(该插件的用途就是扩
展commands.cfg)
注意：要是没有这个插件目录需要用下面的命令把插件复制过来
cp /usr/local/nagios-plugins/libexec /usr/local/nagios/
2) 安装imagepak-base.tar.gz
tar –xvzf imagepak-base.tar.gz
mv base/ /usr/local/nagios/share/images/logos/
3)将apache 的运行用户加到nagios 组里面：
从httpd.conf 中过滤出当前的apache 运行用户
Grep ^User /usr/local/apache2/conf/httpd.conf
我的是daemon，下面将这个用户加入nagios 组
Usermod –G nagios daemon
4)修改apache 配置:
修改apache 的配置文件,增加nagios 的目录,并且访问此目录需要进行身份验证
vi /usr/local/apache2/conf/httpd.conf
在最后增加如下内容:
ScriptAlias /nagios/cgi-bin/usr/local/nagios/sbin
<Directory "/usr/local/nagios/sbin">
Options ExecCGI
AllowOverride None
Order allow,deny
Allow from all
AuthName "NagiosAccess"
AuthType Basic
AuthUserFile /usr/local/nagios/etc/htpasswd//用于此目录访问身份验证的文件
Require valid-user
</Directory>
Alias /nagios/usr/local/nagios/share
<Directory "/usr/local/nagios/share">
Options None
AllowOverride None
Order allow,deny
Allow from all
AuthName "NagiosAccess"
AuthType Basic
AuthUserFile /usr/local/nagios/etc/htpasswd//用于此目录访问身份验证的文件
Require valid-user
</Directory>
5)增加验证用户：
也就是通过web 访问nagios 的时候,必须要用这个用户登陆.在这里我们增加用户test:密
码为123456
#/usr/local/apache2/bin/htpasswd -c /usr/local/nagios/etc/htpasswdtest//用户名
Newpassword://(输入123456)
Re-typenewpassword:(再输入一次密码)
Adding password for user test
6)查看认证文件的内容:
[root@localhostconf]#less /usr/local/nagios/etc/htpasswd
test:OmWGEsBnoGpIc//前半部分是用户名test,后面是加密后的密码
到这里nagios 的安装也就基本完成了,你可以通过web 来访问了.
http://192.168.1.240/nagios 会弹出对话框要求输入用户名密码
输入test,密码123456,就可以进入nagios 的主页面了
但是可以发现什么也点不开,因为nagios 还没启动呢!下面的工作就是修改配置文件,增加
要监控的主机和服务
3.典型配置
nagios 要用起来,就必须修改配置文件,增加要监控的主机和服务才行.在具体做这个动作
之前,下面的概念必须要了解.
1)预备知识:
在Nagios 里面定义了一些基本的对象,一般用到的有:
联系人 contact 出了问题向谁报告?一般当然是系统管理员了
监控时间段 timeperiod 7X24 小时不间断还是周一至周五,或是自定义的其他时间段
被监控主机 Host 所需要监控的服务器,当然可以是监控机自己
监控命令 command nagios 发出的哪个指令来执行某个监控,这也是自己定义的
被监控的服务 Service 例如主机是否存活,80 端口是否开,磁盘使用情况或者自定
义的服务等
注意：多个被监控主机可以定义为一个主机组,多个联系人可以被定义为一个联系人组
2)将示例配置文件复制为真实配置文件名：
cd /usr/local/nagios/etc
把这里.cfg-sample 文件配置文件模板，全部重命名为.cfg
3)修改配置文件：
修改nagios 的主配置文件nagios.cfg
vi nagios.cfg
cfg_file=/usr/local/nagios/etc/localhost.cfg//在前面加#
cfg_file=/usr/local/nagios/etc/contacts.cfg//联系人配置文件路径
cfg_file=/usr/local/nagios/etc/contactgroups.cfg//联系人组配置文件路径
cfg_file=/usr/local/nagios/etc/commands.cfg//命令配置文件路径
cfg_file=/usr/local/nagios/etc/host.cfg//主机配置文件路径
cfg_file=/usr/local/nagios/etc/hostgroups.cfg//服务器组配置文件
cfg_file=/usr/local/nagios/etc/templates.cfg//模板配置文件路径
cfg_file=/usr/local/nagios/etc/timeperiods.cfg//监视时段配置文件路径
cfg_file=/usr/local/nagios/etc/services.cfg//服务配置文件
其他配置文件以实际情况来进行配置(上面所有的.cfg 文件现在etc 目录下是没有的需要手
动建立，否则启动时会报错，另外一种方法就是把路径该为etc/objects 该目录下有默认的
几个文件，视情况而定吧)
改check_external_commands=0 为check_external_commands=1.这行的作用是允许在web
界面下执行重启nagios、停止主机/服务检查等操作。
把command_check_interval 的值从默认的1 改成command_check_interval=10s（根据自己
的情况定这个命令检查时间间隔，不要太长也不要太短）。
主配置文件要改的基本上就是这些，通过上面的修改，发现/usr/local/nagios/etc 并没有
文件hosts.cfg 等一干文件，怎么办？稍后手动创建它们。
4)检查配置文件是否出错
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
出现Total Warnings:0
Total Errors:0//这样表示配置文件没有错误
5)修改CGI 脚本控制文件cgi.cfg
#vi cgi.cfg
第二个要修改的配置文件是cgi.cfg,它的作用是控制相关cgi 脚本。先确保
use_authentication=1。接下来修改default_user_name=test(前面创建的用户名),再后面
的修改在下表列出：
authorized_for_system_information=nagiosadmin,mandahang，test//后面跟的都是用户
名
authorized_for_configuration_information=nagiosadmin,mandahang，test
authorized_for_system_commands=mandahang,test
authorized_for_all_services=nagiosadmin,mandahang,test
authorized_for_all_hosts=nagiosadmin,mandahang,test
authorized_for_all_service_commands=nagiosadmin,mandahang,test
authorized_for_all_host_commands=nagiosadmin,mandahang,test
注意：在上面的配置文件里面加上新加的用户test
那么上述用户名打那里来的呢？是执行命令/usr/local/apache2/bin/htpasswd –c
/usr/local/nagios/etc/htpasswdtest（用户名test）所生成的
6)配置各种配置文件
(1)定义监控时间段,创建配置文件timeperiods.cfg：
#vi timeperiods.cfg
Define timeperiod{
timeperiod_name 24x7//时间段的名称,这个地方不要有空格
alias 24 HoursA Day,7 Days A Week
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}
定义了一个监控时间段,它的名称是24x7,监控的时间是每天全天24 小时：
(2)定义联系人,创建配置文件contacts.cfg
#vi contacts.cfg
Define contact{
contact_name test//联系人的名称,这个地方不要有空格
alias sysadmin
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
service_notification_commands notify-by-email, service-notify-by-sms
host_notification_commands host-notify-by-email, host-notify-by-sms
email qubaquan@ccpower.com.cn
pager 1391119xxxx
}
创建了一个名为test 的联系人,下面列出其中重要的几个选项做说明：
service_notification_period 24x7
服务出了状况通知的时间段,这个时间段就是上面在timeperiods.cfg 中定义的.
host_notification_period 24x7
主机出了状况通知的时间段,这个时间段就是上面在timeperiods.cfg 中定义的
service_notification_options w,u,c,r
当服务出现w—报警(warning),u—未知(unkown),c—严重(critical),或者r—从异常情况
恢复正常,在这四种情况下通知联系人.
host_notification_options d,u,r
当主机出现d—当机(down),u—返回不可达(unreachable),r—从异常情况恢复正常,在这3
种情况下通知联系人
service_notification_commands notify-by-email, service-notify-by-sms,
服务出问题通知采用的命令notify-by-email, service-notify-by-sms 命令是在
commands.cfg 中定义的,作用是给联系人发邮件.发信息
host_notification_commands host-notify-by-email，host-notify-by-sms
同上,主机出问题时采用的也是发邮件和发信息的方式通知联系人
qubaoquan@ccpower.com.cn
很明显,联系的人email 地址
Pager137xxxxxxxxxxxxxx//电话
(3)下面就可以将多个联系人组成一个联系人组,创建文件contactgroups.cfg：
#vi contactgroups.cfg
define contactgroup{
contactgroup_name sagroup//联系人组的名称,同样不能空格
alias SystemAdministrators//别名
members test//组的成员,来自于上面定义的
contacts.cfg,如果有多个联系人则以逗号相隔
}
(4)定义被监控主机,创建文件hosts.cfg:
#vi hosts.cfg
define host{
host_name nagios//被监控主机的名称,最好别带空格
alias nagios//别名
address 192.168.1.240//被监控主机的IP 地址
check_command check-host-alive//监控的命令check-host-alive,这个命令来自
commands.cfg,用来监控主机是否存活
max_check_attempts 5//检查失败后重试的次数
check_period 24x7//检查的时间段24x7,同样来自于我们之前在timeperiods.cfg 中
定义的
contact_groups sagroup//联系人组,上面在contactgroups.cfg 中定义的sagroup
notification_interva l10//提醒的间隔,每隔10 秒提醒一次
notification_period 24x7//提醒的周期,24x7,同样来自于我们之前在timeperiods.cfg
中定义的
notification_options d,u,r//指定什么情况下提醒
}
(5)通过简单的复制修改就可以定义多个主机了.我们在这加上另外一台机器与联系人可以
组成联系人组一样,多个主机也可以组成主机组:
主机名为：apache ip：192.168.1.208
创建文件hostgrops.cfg
#vi hostgroups.cfg
define hostgroup{
hostgroup_name linux-servers//主机组名称
alias linux-servers//别名
members nagios,apache//组的成员主机,多个主机以逗号相隔,必须是上面
hosts.cfg 中定义的
}
(6)下面是最关键的了,用nagios主要是监控一台主机的各种信息,包括本机资源,对外的服
务等等.这些在nagios里面都是被定义为一个个的项目(nagios称之为服务,为了与主机提
供的服务相区别,我这里用项目这个词),而实现每个监控项目,则需要通过commands.cfg文
件中定义的命令.
例如我们现在有一个监控项目是监控一台机器的web服务是否正常,我们需要哪些元素呢?
最重要的有下面三点:首先是监控哪台机,然后是这个监控要用什么命令实现,最后就是出
了问题的时候要通知哪个联系人?
定义监控的项目,也叫服务,创建services.cfg:
#vi services.cfg
Define service{
host_name nagios//被监控的主机,hosts.cfg 中定义的
service_description check-host-alive check_commandcheck-host-alive//这个监控
项目的描述,这个会在web 页面中出现
max_check_attempts 5//重试的次数
normal_check_interva l3//循环检查的间隔时间
retry_check_interva l2
check_period 24x7//监控的时间段,是timeperiods.cfg 中定义的
notification_interva l10
notification_period 24x7//通知的时间段
notification_options w,u,c,r//在监控的结果是wucr 时通知联系人
contact_groups sagroup//联系人组,是contactgroups.cfg 中定义的
}
(7)定义上面配置文件中指定的命令
vi commands.cfg
#host-notify-by-sms //发送短信报警
define command {
command_name host-notify-by-sms
command_line /usr/local/bin/sms_send "Host $HOSTSTATE$alert for $HOSTNAME$!
on '$DATETIME$' " $CONTACTPAGER$
}
#service notify by sms //发送短信报警
define command {
command_name service-notify-by-sms
command_line /usr/local/bin/sms_send
"'$HOSTADDRESS$'$HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$" $CONTACTPAGER$
}
其他命令已经在文件中，不须更改只有短信报警需要定义
（8）这样整个的配置过程就结束了.虽然功能很简单,但是已经为以后扩展打下了良好的基
础.
在运行nagios之前首先做测试
/usr/local/nagios/bin/nagios –v /usr/local/nagios/etc/nagios.cfg
看到下面这些信息就说明没问题了
TotalWarnings:0
TotalErrors:0
作为守护进程后台启动nagios
/usr/local/nagios/bin/nagios –d /usr/local/nagios/etc/nagios.cfg
登陆http://192.168.1.240/nagios/来查看吧.点左边的HostDetail
而对其他的机器则显得有点无能为力.毕竟没得到被控主机的适当权限是不可能
得到这些信息的.为了解决这个问题,nagios 有这样一个附加组件----NRPE.用
它就可以完成对linux 类型主机”本地信息”的监控.
在被监控主机上
1)增加用户
useradd nagios
设置密码
# passwd nagios
2)安装nagios 插件
tar -zxvf nagios-plugins-1.4.9.tar.gz
cd nagios-plugins-1.4.9
编译安装
./configure --enable-redhat-pthread-workaround
make
make install
这一步完成后会在/usr/local/nagios/下生成两个目录libexec 和share
3）验证：
ls /usr/local/nagios/
libexec share
4）修改目录权限
# chown nagios：nagios /usr/local/nagios
# chown -R nagios：nagios /usr/local/nagios/libexec
5)安装nrpe
1.解压缩
tar -zxvf nrpe-2.8.1.tar.gz
cd nrpe-2.8.1
2.编译
./configure
NRPE 的端口是5666
3.make all
接下来安装NPRE 插件,daemon 和示例配置文件
4.安装check_nrpe 这个插件
make install-plugin
之前说过监控机需要安装check_nrpe 这个插件,被监控机并不需要,我们在这里安装它是为
了测试的目的
5.安装deamon
make install-daemon
6.安装配置文件
make install-daemon-config
7.现在再查看nagios 目录就会发现有4 个目录了
# ls /usr/local/nagios/
bin etc libexec share
将NRPE deamon 作为xinetd 下的一个服务运行的.在这样的情况下xinetd 就必须要先安装
好,不过一般系统已经默认装了
6)安装xinetd 脚本
# make install-xinetd
输出如下
/usr/bin/install -c -m 644 sample-config/nrpe.xinetd /etc/xinetd.d/nrpe
可以看到创建了这个文件/etc/xinetd.d/nrpe
7）编辑这个脚本
vi /etc/xinetd.d/nrpe
# default: on
# description: NRPE (Nagios Remote Plugin Executor)
service nrpe
{
flags = REUSE
socket_type = stream
port = 5666
wait = no
user = nagios
group = nagios
server = /usr/local/nagios/bin/nrpe
server_args = -c /usr/local/nagios/etc/nrpe.cfg --inetd
log_on_failure += USERID
disable = no
only_from = 127.0.0.1 在后面增加监控主机的地址192.168.1.240,以空格
间隔
}
改后
only_from = 127.0.0.1 192.168.1.240
8）编辑/etc/services 文件,增加NRPE 服务
vi /etc/services
增加如下
# Local services
nrpe 5666/tcp # nrpe
9）重启xinetd 服务
# service xinetd restart
10）启动nrpe
/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
11）查看NRPE 是否已经启动
# netstat -at|grep nrpe
tcp 0 0 *:nrpe *:* LISTEN
# netstat -an|grep 5666
tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN
可以看到5666 端口已经在监听了
12)测试NRPE 是否则正常工作
之前安装了check_nrpe 这个插件用于测试,现在就是用的时候.执行
/usr/local/nagios/libexec/check_nrpe -H localhost
会返回当前NRPE 的版本
# /usr/local/nagios/libexec/check_nrpe -H localhost
NRPE v2.8.1
也就是在本地用check_nrpe 连接nrpe daemon 是正常的
注:为了后面工作的顺利进行,注意本地防火墙要打开5666 能让外部的监控机访问
/usr/local/nagios/libexec/check_nrpe –h 查看这个命令的用法
可以看到用法是check_nrpe –H 被监控的主机 -c 要执行的监控命令
注意:-c 后面接的监控命令必须是nrpe.cfg 文件中定义的.也就是NRPE daemon 只运行
nrpe.cfg 中所定义的命令
13）查看NRPE 的监控命令
cd /usr/local/nagios/etc
vi nrpe.cfg
# The following examples use hardcoded command arguments...
command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c 10
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command[check_hda1]=/usr/local/nagios/libexec/check_disk -w 20 -c 10 -p /dev/hda1
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 150 -c 200
注意：其他命令需要自行添加
也就是check_nrpe 的-c 参数可以接的内容,等号=后面是实际执行的插件程序(只这与
commands.cfg 中定义命令的形式十分相似,不过是写在了一行).也就是说check_users 就是
等号后面/usr/local/nagios/libexec/check_users -w 5 -c 10 的简称.
我们可以很容易知道上面这5 行定义的命令分别是检测登陆用户数,cpu 负载,hda1 的容量,
僵尸进程,总进程数.各条命令具体的含义见插件用法(执行”插件程序名 –h”)
由于-c 后面只能接nrpe.cfg 中定义的命令,也就是说现在我们只能用上面定义的这五条命
令.我们可以在本机实验一下.执行
/usr/local/nagios/libexec/check_nrpe -H localhost -c check_users
/usr/local/nagios/libexec/check_nrpe -H localhost -c check_load
/usr/local/nagios/libexec/check_nrpe -H localhost -c check_hda1
/usr/local/nagios/libexec/check_nrpe -H localhost -c check_zombie_procs
/usr/local/nagios/libexec/check_nrpe -H localhost -c check_total_procs
在运行nagios的监控主机上
之前已经将nagios 运行起来了,现在要做的事情是: 安装check_nrpe 插件
– 在commands.cfg 中创建check_nrpe 的命令定义,因为只有在commands.cfg 中定义过的
命令才能在services.cfg 中使用
创建对被监控主机的监控项目
1)安装check_nrpe 插件
# tar -zxvf nrpe-2.8.1.tar.gz
# cd nrpe-2.8.1
# ./configure
# make all
# make install-plugin
只运行这一步就行了,因为只需要check_nrpe 插件
2)在apache 刚装好了nrpe,现在我们测试一下监控机使用
check_nrpe 与被监控机运行的nrpedaemon 之间的通信.
# /usr/local/nagios/libexec/check_nrpe -H 192.168.1.208
NRPE v2.8.1
看到已经正确返回了NRPE 的版本信息,说明一切正常.
3)在commands.cfg 中增加对check_nrpe 的定义
vi /usr/local/nagios/etc/commands.cfg
在最后面增加如下内容
########################################################################
#
# 2007.9.5 add by yahoon
# NRPE COMMAND
#
########################################################################
# 'check_nrpe ' command definition
define command{
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}
意义如下
command_name check_nrpe
定义命令名称为check_nrpe,在services.cfg 中要使用这个名称.
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
这是定义实际运行的插件程序.这个命令行的书写要完全按照check_nrpe 这个命令的用法.
不知道用法的就用check_nrpe –h 查看
-c 后面带的$ARG1$参数是传给nrpe daemon 执行的检测命令,之前说过了它必须是nrpe.cfg
中所定义的那5 条命令中的其中一条.在services.cfg 中使用check_nrpe 的时候要用!带上
这个参数
下面就可以在services.cfg 中定义对apache 主机磁盘容量的监控
define service{
host_name apache
被监控的主机名,这里注意必须是linux 且运行着nrpe,而且必须是hosts.cfg 中定义的
service_description check-disk
监控项目的名称
check_command check_nrpe!check_disk
监控命令是check_nrpe,是在commands.cfg 中定义的,带的参数是check_disk,是
在nrpe.cfg 中定义的
max_check_attempts 5
normal_check_interval 3
retry_check_interval 2
check_period 24x7
notification_interval 10
notification_period 24x7
notification_options w,u,c,r
contact_groups sagroup
}
像这样将其余几个监控项目加进来.
备注：解决statusmap 不现实的问题：
（1）进入nagios 源码目录
cd /srv/nagios/nagios-3.1.0/
（2）运行. make devclean
（3）运行编译：/configure --prefix=/usr/local/nagios
--with-gd-lib=/usr/local/libgd/lib
--with-gd-inc=/usr/local/libgd/include
（4）make all ; make intall