Wednesday, September 09, 2009

Nagios Event Handlers - Love them

What is Nagios? Nagios IMHO is the best Open Source monitoring system out there. It supports hosts checks, a level to determine on a host level if a box is considered "up". It supports service check, a level to determine if a particular service such as mySQL is up. It has features to log all events to a flat file or to a DB. It can notify you when a service is in a warning state, error state or unknown state.

For the purpose of this article, I am going to talk about handling events such as a clearing up swap.

First, let us look at some configuration of Nagios. We are going to define a command, then service acting on that command. Let us assume that the nagios install is in /usr/local/nagios.

Therefore, in /usr/local/nagios/ a few configuration files are key:
- /usr/local/nagios/etc/objects/commands.cfg - the command file where the checks are defined
- /usr/local/nagios/etc/hosts/*/hosts.cfg - the services file where the checks are defined for execution based on other directives in this file.

A command:

# 'check_local_swap' command definition
define command{
command_name check_local_swap
command_line $USER1$/check_swap -w $ARG1$ -c $ARG2$

This says that check_local_swap executes check_swap with a warning threshold of $ARG1 and a critical threshold or $ARG2

Next when defining a service for a host

define service{
use generic-service; Name of service template to use
host_name dbfacebook34b ; hostname
service_description SYS:Swap ; what shows up in alerts
is_volatile 0
check_period 24x7 ; threshold when to check (all the time)
max_check_attempts 4 ; threshold to check before marking state
event_handler handle-swap ; handle an event (another command)
normal_check_interval 5 ; in seconds
retry_check_interval 1 ; only try once before reporting the state
contact_groups itops ; contact group to send notifications to
notification_options w,u,c,r ; need to look this up for all defs
notification_interval 600 ; retry sending notifs every 8 mins
notification_period 24x7 ; keep sending them
check_command check_nrpe!check_local_swap!80%!55% ; execute the event handler and warn like hell

Lots of goodies as you can see. Let us look at the event handler

define command{
command_name handle-swap
command_line /home/scripts/

This means execute this script whenever any event for swap occurs (I decided to make this simple and not put a threshold on this).

What does do - well it’s a perl script that looks at free memory and if only a few 100K of swap is in use, swapoff -a; swapon -a;

In this case, it is a bit safe to do this. Why do this? Why not just turn of swap. I have talked in depth about this subject-but for a minor recap. Linux needs swap else, kswapd will freak out. Swap in DB's is bad so I clean it up automatically since O_DIRECT on my SAN is not an option.

Why not just run a cron job? Nagios keeps a log, I like to review what is happening from a central location, and nagios is freaking COOL.


Anonymous said...

Linux without swap works fine and always did. No idea where that "freak" came out.

Dathan Vance Pattishall said...

How did you configure linux without swap. Every time I turn off swap, kswapd chews up a ton of CPU resources, so much so it puts mysql in a run queue.

Anonymous said...

Nice write up !

Linux without swap can work, but it'll depend on your workload, amount of memory, kernel version...

Apart from the cool factor, and the obvious educational interest, this solution looks a bit complex for a production system - indeed using cron seems (to me) like a better (simpler, less dependencies) choice. If you want central "reviewability" (did I just made that word up ?) use syslog (which in production should always be configured to log to a central node anyway, right ?) !

Related also: the --memlock option and the /proc/sys/vm/swappiness tunable.

Kaiser Beto said...

/proc/sys/vm/swappiness is just a "suggestion" to the kernel, setting it to 0 doesn't prevent the OS from swapping... memlock produces very unpredictable results; it's a roll of the dice :) Large pages might be something interesting to try next ...