Wednesday, September 09, 2009

Nagios Event Handlers - Love them

What is Nagios? Nagios IMHO is the best Open Source monitoring system out there. It supports hosts checks, a level to determine on a host level if a box is considered "up". It supports service check, a level to determine if a particular service such as mySQL is up. It has features to log all events to a flat file or to a DB. It can notify you when a service is in a warning state, error state or unknown state.

For the purpose of this article, I am going to talk about handling events such as a clearing up swap.

First, let us look at some configuration of Nagios. We are going to define a command, then service acting on that command. Let us assume that the nagios install is in /usr/local/nagios.

Therefore, in /usr/local/nagios/ a few configuration files are key:
- /usr/local/nagios/etc/objects/commands.cfg - the command file where the checks are defined
- /usr/local/nagios/etc/hosts/*/hosts.cfg - the services file where the checks are defined for execution based on other directives in this file.


A command:

# 'check_local_swap' command definition
define command{
command_name check_local_swap
command_line $USER1$/check_swap -w $ARG1$ -c $ARG2$
}


This says that check_local_swap executes check_swap with a warning threshold of $ARG1 and a critical threshold or $ARG2


Next when defining a service for a host

define service{
use generic-service; Name of service template to use
host_name dbfacebook34b ; hostname
service_description SYS:Swap ; what shows up in alerts
is_volatile 0
check_period 24x7 ; threshold when to check (all the time)
max_check_attempts 4 ; threshold to check before marking state
event_handler handle-swap ; handle an event (another command)
normal_check_interval 5 ; in seconds
retry_check_interval 1 ; only try once before reporting the state
contact_groups itops ; contact group to send notifications to
notification_options w,u,c,r ; need to look this up for all defs
notification_interval 600 ; retry sending notifs every 8 mins
notification_period 24x7 ; keep sending them
check_command check_nrpe!check_local_swap!80%!55% ; execute the event handler and warn like hell
}



Lots of goodies as you can see. Let us look at the event handler

define command{
command_name handle-swap
command_line /home/scripts/handle_swap.pl
}


This means execute this script whenever any event for swap occurs (I decided to make this simple and not put a threshold on this).


What does handle_swap.pl do - well it’s a perl script that looks at free memory and if only a few 100K of swap is in use, swapoff -a; swapon -a;

In this case, it is a bit safe to do this. Why do this? Why not just turn of swap. I have talked in depth about this subject-but for a minor recap. Linux needs swap else, kswapd will freak out. Swap in DB's is bad so I clean it up automatically since O_DIRECT on my SAN is not an option.

Why not just run a cron job? Nagios keeps a log, I like to review what is happening from a central location, and nagios is freaking COOL.

4 comments:

Anonymous said...

Linux without swap works fine and always did. No idea where that "freak" came out.

Dathan Vance Pattishall said...

How did you configure linux without swap. Every time I turn off swap, kswapd chews up a ton of CPU resources, so much so it puts mysql in a run queue.

Anonymous said...

Nice write up !

Linux without swap can work, but it'll depend on your workload, amount of memory, kernel version...

Apart from the cool factor, and the obvious educational interest, this solution looks a bit complex for a production system - indeed using cron seems (to me) like a better (simpler, less dependencies) choice. If you want central "reviewability" (did I just made that word up ?) use syslog (which in production should always be configured to log to a central node anyway, right ?) !

Related also: the --memlock option and the /proc/sys/vm/swappiness tunable.

Kaiser Beto said...

/proc/sys/vm/swappiness is just a "suggestion" to the kernel, setting it to 0 doesn't prevent the OS from swapping... memlock produces very unpredictable results; it's a roll of the dice :) Large pages might be something interesting to try next ...