Friday, July 13, 2012

A Case for Automation of OpenNMS for XTRADB Backups with Python

If you read my blog from time to time, you know I am a huge advocate of automating tasks. If I have to do something more then once I might was well automate the task.
Currently at my new gig, we are using opennms. OpenNMS is a combo of Nagios, Ganglia talking over SNMP Java goodness. They recently released a REST API that is less then adequately documented making integration rather difficult. But, I am jumping ahead let's talk about the problem.

Backing up an INNODB Slave is rather easy with Percona's xtrabackup program. It's free, works, offers better features than its paid counter part. Additionally its opensource.

In my environment my algorithm is as follows:

Mount A Huge Disk over NFS which gets snapshotted
At the Start of the day (00:00:00 UTC) do a full backup.
The command used: innobackupex --user=root --password='***' $backupDir --parallel=8 --slave-info
At every top of the hour do an incremental backup if a full backup does not exist

The problem is for me to do an incremental backup I use the following command

my $INNOBACKUP="innobackupex --user=root --password='***' $incrementalDir --parallel=8 --incremental --slave-info --safe-slave-backup --incremental-basedir=";

$INNOBACKUP .= $lastIncremental;

--incremental says do an incremental backup

--slave-info says dump the slave info

--safe-slave-backup says stop the slave and start it back when the backup finishes *THE PROBLEM*

--incremental-basedir is the last successful incremental directory

The problem is I would get alerted every-time incremental runs forcing me to acknowledge the alert.

This is annoying so let's fix this with automation. (if the slave is off it's okay I am backing up the DR)
So the new algorithm is:

Set downtime over the OPENNMS Rest API
Remove downtime over the opennms rest API

Setting downtime for OpenNMS is hard since the documentation is not helpful. Good news I was able to find some online code to see how things worked. OpenNMS is an opensource product with code viewable from fisheye. Anytime you want to figure out how an API works, look at the test code of said API or read the API twists and turns; I did just that by reading this.

But nothing I found told me how to authenticate to use the REST API. Thus searching around and looking at the code shipped with Opennms I found a perl script called in /opt/opennms/bin-it uses COOKIES!

Good thing I am good with PERL, thus I went ahead to try to do some simple fetches until I realize that getting LWP, HTTP and various other perl modules including ISBN::Data was near impossible. Making an RPM for each one is just a huge waste of time and forcing CPAN installs across N boxes sucks not to mention its just plain WRONG.

So I looked at my options; PYTHON is loaded by centos by default and has built in httplibs like urllib2, http, cookie etc. Perfect.
Below is the script


# @author Dathan Vance Pattishall
# OpenNMS Schedule downtime script

import urllib, urllib2, cookielib, pprint, os, sys
import elementtree.ElementTree as ET
import time
from datetime import date, tzinfo, timedelta, datetime
from optparse import OptionParser

usage = "usage: %prog [options]"
parser = OptionParser(usage=usage)
parser.add_option("-v", "--verbose", action="store_true", dest="verbose", default=False, help="make lots of noise [default]")
parser.add_option("-t", "--downtime", dest="downtime", help="length of downtime")
parser.add_option("-c", "--contains", dest="contains", help="Schedule downtime for all nodes that contain this string")
parser.add_option("-l", "--like", dest="like", help="get servers with a wild card e.g. shard%-dr")
parser.add_option("-d", "--delete", action="store_true", dest="delete", default=False, help="delete and outage")
parser.add_option("-p", "--package",  dest="package", default="SFO", help="which nms package")

(options, args) = parser.parse_args()

base_url = 'https://enteropennmshosthere/opennms/'
auth_url = base_url + 'j_spring_security_check'
nodes_url = base_url + 'rest/nodes/'
sched_outage_url = base_url + 'rest/sched-outages/'

username = 'outagerole'
password = 'add_pass_here'

# get the hostname
hostname  = os.environ['HOSTNAME']
host_abbr = os.environ['HOSTNAME'].split('.')[0]

if not host_abbr :
    print("Environment is not setup correctly")

# set up the cookie jar
cj = cookielib.CookieJar()

# build opener returns an OpenerDirector:
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

# now the cookie will work with urlopen as a global for all following requests

# log-in and set the cookie
login_data = urllib.urlencode({'j_username' : username, 'j_password' : password, 'Login': 'Login'}), login_data)

# get the nodes

if options.contains :

    url_data = { 'label' : options.contains, 'limit' : 0, 'comparator' : 'contains' }
    url_data = urllib.urlencode(url_data)
    resp = + '?' + url_data)
    outagename = options.contains + "-contiains-script-outage"

elif :

    url_data = { 'label' : + '%', 'limit' : 0, 'comparator' : 'ilike' }

url_data = urllib.urlencode(url_data)
    resp = + '?' + url_data)
    outagename ='%', 'WC') + "-ilike-script-outage"

else :
    # get the hostname info to see if the host is in opennms
    url_data = { 'label' : hostname, 'limit' : 0 }
    url_data = urllib.urlencode(url_data)
    resp = + '?' + url_data)
    outagename = host_abbr + "-script-outage"

# tree is an Element
tree = ET.XML(

# build a label name to id map
name_to_id_map = {}

for node in tree.getiterator('node') :
    if node.get('label') :
        name_to_id_map[node.get('label')] = node.get('id')

if not name_to_id_map and options.contains:
    print("This HOST [%s] is not in nms" % options.contains)

# delete an outage - really outage name is all that is needed 
if options.delete :
    try :
        print("Deleting Outage: %s" % outagename)
        req3 = urllib2.Request(sched_outage_url + outagename)
        req3.add_header('Content-Type', 'application/xml')
        req3.add_header('Content-Length', '0')
        req3.get_method = lambda: 'DELETE'
        r = urllib2.urlopen(req3)

    except urllib2.HTTPError :
        print("This Outage: %s was already deleted" % outagename)


# schedule downtime
print("Scheduling downtime for 1 hour Outage Name %s" % outagename)

start =
downtime = 1 #1 hour

if options.downtime :
    downtime = int(options.downtime) # units of hours

end = start + timedelta(hours=downtime)

end = end.strftime('%d-%b-%Y %H:%M:%S')
start = start.strftime('%d-%b-%Y %H:%M:%S')

print("Start of the downtime: %s" % start)
print("End of the downtime: %s" % end)

# build the request
req = urllib2.Request(sched_outage_url)
req.add_header('Content-Type', 'application/xml')

xml_str = "" + "<outage name=" + outagename + " type="specific">" + "<time begins=" + str(start) + " ends=" + str(end) + ">"

for nodename in name_to_id_map :
    xml_str += "<node id=" + name_to_id_map[nodename] + ">"

xml_str += "</node></time></outage>"

# send the XML
r = urllib2.urlopen(req)

# tell notifd to attach to the downtime
req2 = urllib2.Request(sched_outage_url + outagename + '/notifd')
req2.add_header('Content-Type', 'application/xml')
req2.add_header('Content-Length', '0')
req2.get_method = lambda: 'PUT'
r = urllib2.urlopen(req2)

# tell pollerd to attach to the downtime
req3 = urllib2.Request(sched_outage_url + outagename +'/pollerd/' + options.package)
req3.add_header('Content-Type', 'application/xml')
req3.add_header('Content-Length', '0')
req3.get_method = lambda: 'PUT'
r = urllib2.urlopen(req3)


This is a quick and dirty script which I will eventually turn into a class to control openNMS from the command line. In summary it logs in with the specified user. Gets a cookie. Issues commands pulled from reading Java code (good thing I can code good in java as well). My main problem was searching to find that there are unpublished filters like comparator => contains and finding that the post structure for schedule-outages was not key value param but XML!!

public void testSetOutage() throws Exception {
       String url = "/sched-outages";
       String outage = "<?xml version=\"1.0\"?>" +
               "<outage name='test-outage' type='specific'>" +
               "<time day='friday' begins='13:20:00' ends='15:30:00' />" +
               "<time begins='17-Feb-2012 19:20:00' ends='18-Feb-2012 22:30:00' />" +
               "<node id='11' />" +
       sendPost(url, outage);

In summary everything works well. Backups are working and I am happy to not send pages. Eventually when I get around to it I'll upload this script to git. 

No comments: