Nagios遇到的一点问题--关于如何使用timeperiod

推荐原创

sky_max 2008-12-04 10:32:07 博主文章分类：Nagios ©著作权

文章标签 职场 Nagios iOS 休闲系统监控 文章分类 iOS 移动开发

©著作权归作者所有：来自51CTO博客作者sky_max的原创作品，请联系作者获取转载授权，否则将追究法律责任

关于在使用Nagios过程中遇到的一点问题

——关于如何使用timeperiod

1. 前言、问题的产生

问题的产生是这样的，在我的应用服务器（App server）上跑着一套业务系统，使用的是weblogic中间件。由于出于安全方面的考虑，这套业务需要在下班以后关闭，并且在第二天上班的时候开启。其实这个实现也没什么复杂的，可以写两个shell脚本，一个用于启动weblogic服务，另一个用于停止weblogic服务（也就是kill掉相应的java进程），然后把这两个脚本加到crontab里，安排好执行时间，让它们定时执行就可以了。

这样虽然定时停开服务是实现了，可这样带来了一个新的问题。具体的问题如下，我使用Nagios软件来监控系统整个运行状况，不仅包括主机的状态，还包括数据库、中间件的状态等。Nagios软件24小时不间断的监控着整个系统的运行状态（很尽职尽责），在下班后，weblogic服务已经被停掉，这属于正常状态，但Nagios依然去检查weblogic的运行状态，结果可想而知当然是不能获得任何信息（critical状态）。于是，Nagios将critical状态报告给我（我配置的是email的通知方式）。我的邮箱里收到了一堆垃圾邮件，没有任何价值的信息。

那么如何解决这种具有时间段要求的监控问题呢？仔细的Nagios的官方文档，我们不难发现其中有一个定义叫timeperiod，这个属性可以控制时间范围。下面简单的说明一下我的处理方法。

关于如何通过Nagios监控weblogic的方法，参见我的另一篇博文《通过Nagios监控Weblogic服务》，链接[url]http://skymax.blog.51cto.com/365901/101603[/url]。

2. 问题的解决方式

2.1. 配置信息

由于配置文件的较多，而且文件的内容过多，我在这里仅列出与文件相关的一些配置。

· 服务监控配置

#the check_wls_server_adminserver on the remote host.

define service{

use generic-service

host_name HPUX_XX.XX.XX.XX

service_description Weblogic Server adminserver

check_command check_nrpe! check_wls_server_adminserver

}

· Generic-service定义

###############################################################################

# SERVICE TEMPLATES

###############################################################################

# Generic service definition template - This is NOT a real service, just a template!

define service{

name generic-service ; The 'name' of this service template

active_checks_enabled 1 ; Active service checks are enabled

passive_checks_enabled 1 ; Passive service checks are enabled/accepted

parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)

obsess_over_service 1 ; We should obsess over this service (if necessary)

check_freshness 0 ; Default is to NOT check service 'freshness'

notifications_enabled 1 ; Service notifications are enabled

event_handler_enabled 1 ; Service event handler is enabled

flap_detection_enabled 1 ; Flap detection is enabled

failure_prediction_enabled 1 ; Failure prediction is enabled

process_perf_data 1 ; Process performance data

retain_status_information 1 ; Retain status information across program restarts

retain_nonstatus_information 1 ; Retain non-status information across program restarts

is_volatile 0 ; The service is not volatile

check_period 24x7 ; The service can be checked at any time of the day

max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state

normal_check_interval 10 ; Check the service every 10 minutes under normal conditions

retry_check_interval 2 ; Re-check the service every two minutes until a hard state can be determined

contact_groups admins ; Notifications get sent out to everyone in the 'admins' group

notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events

notification_interval 60 ; Re-notify about service problems every hour

notification_period 24x7 ; Notifications can be sent out at any time

}

· 24x7的定义

# This defines a timeperiod where all times are valid for checks,

# notifications, etc. The classic "24x7" support nightmare. :-)

define timeperiod{

timeperiod_name 24x7

alias 24 Hours A Day, 7 Days A Week

sunday 00:00-24:00

monday 00:00-24:00

tuesday 00:00-24:00

wednesday 00:00-24:00

thursday 00:00-24:00

friday 00:00-24:00

saturday 00:00-24:00

}

从以上配置不难看出，我所定义的监控服务所使用的模板是generic-service，而该模板中定义的check_period和notification_period使用的都是timeperiod 24x7。timeperiod 24x7中明确定义时间范围是从周一到周日，每天24小时全天。

问题的症结就在于此，timeperiod的定义。如果我们把监控服务的监控时间段（check_period）改为我们所希望的工作时间（从早8到晚5，周一到周五），那么问题就可以迎刃而解了。

2.2. 修改配置文件

· 定义一个新的timeperiod。

# Some P.R.C holidays

# 中国的一些法定节假日

define timeperiod{

name cn-holidays

timeperiod_name cn-holidays

alias CN Holidays

january 1 00:00-00:00 ; 1.1

may 1 00:00-00:00 ; 5.1

october 1 00:00-00:00 ; 10.1

}

# Work time

# Week monday to friday

# Time 8:00 to 17:00

# 工作时间周一至周五的早八点到晚五点

define timeperiod{

timeperiod_name cn_work_time_8x5

alias CN Work TIme 8x5

use cn-holidays ;使用cn-holidays模板

sunday 00:00-00:00

monday 08:00-17:00

tuesday 08:00-17:00

wednesday 08:00-17:00

thursday 08:00-17:00

friday 08:00-17:00

saturday 00:00-00:00

}

· 使用刚刚定义好的timeperiod创建一个新的服务监控模板。

# 8x5 service definition template - This is NOT a real service, just a template!

define service{

name generic-service-8x5 ; The name of this service template

use generic-service ; Inherit default values from the generic-service definition

check_period cn_work_time_8x5

notification_period cn_work_time_8x5

}

· 使用新定义的模板修改具体服务监控配置

#the check_wls_server_adminserver on the remote host.

define service{

use generic-service-8x5

host_name HPUX_XX.XX.XX.XX

service_description Weblogic Server adminserver

check_command check_nrpe!check_wls_server_adminserver

}

配置修改完了，下一步具体验证一下。

· 首先验证配置文件是否书写正确。

bash-3.00$ ./nagios -v ../etc/nagios.cfg

Nagios 3.0.3

Last Modified: 06-25-2008

License: GPL

Reading configuration data...

Running pre-flight check on configuration data...

Checking services...

Checked 111 services.

Checking hosts...

Checked 7 hosts.

Checking host groups...

Checked 1 host groups.

Checking service groups...

Checked 1 service groups.

Checking contacts...

Checked 2 contacts.

Checking contact groups...

Checked 1 contact groups.

Checking service escalations...

Checked 0 service escalations.

Checking service dependencies...

Checked 0 service dependencies.

Checking host escalations...

Checked 0 host escalations.

Checking host dependencies...

Checked 0 host dependencies.

Checking commands...

Checked 25 commands.

Checking time periods...

Checked 7 time periods.

Checking for circular paths between hosts...

Checking for circular host and service dependencies...

Checking global event handlers...

Checking obsessive compulsive processor commands...

Checking misc settings...

Total Warnings: 0

Total Errors: 0

Things look okay - No serious problems were detected during the pre-flight check

好，配置没有问题，下一步重启Nagios服务。我的操作系统Solaris10，我将Nagios配置成了SMF管理的服务，重启服务较方便。

bash-3.00# svcadm restart nagios

bash-3.00# svcs nagios

STATE STIME FMRI

online 9:48:11 svc:/site/nagios:default

2.3. 验证

观察一下具体的监控情况，主要是看一下是否在下班时间是否还是会发出报警。邮箱里再也没有收到那些无用的垃圾邮件了，问题得以解决。

3. 结语

以上是我在使用Nagios监控系统时遇到的一个具体问题，以及解决过程、方法。由于监控的环境复杂、多变，在使用Nagios的过程中会遇到各种特殊的问题、和特殊的需要。不过还好，Nagios的整体设计架构比较强大，大部分的问题都能得以解决。当然如果有时间还是仔细看看Nagios的官方文档，会从中受益匪浅。

上一篇：修复错误配置fstab文件导致系统无法正常启动

下一篇：在HP-UX上的一个应用数据迁移实例

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯