Friday, July 13, 2012

A Case for Automation of OpenNMS for XTRADB Backups with Python

If you read my blog from time to time, you know I am a huge advocate of automating tasks. If I have to do something more then once I might was well automate the task.
Currently at my new gig, we are using opennms. OpenNMS is a combo of Nagios, Ganglia talking over SNMP Java goodness. They recently released a REST API that is less then adequately documented making integration rather difficult. But, I am jumping ahead let's talk about the problem.


Backing up an INNODB Slave is rather easy with Percona's xtrabackup program. It's free, works, offers better features than its paid counter part. Additionally its opensource.

In my environment my algorithm is as follows:

Mount A Huge Disk over NFS which gets snapshotted
At the Start of the day (00:00:00 UTC) do a full backup.
The command used: innobackupex --user=root --password='***' $backupDir --parallel=8 --slave-info
At every top of the hour do an incremental backup if a full backup does not exist

The problem is for me to do an incremental backup I use the following command


my $INNOBACKUP="innobackupex --user=root --password='***' $incrementalDir --parallel=8 --incremental --slave-info --safe-slave-backup --incremental-basedir=";

$INNOBACKUP .= $lastIncremental;

--incremental says do an incremental backup

--slave-info says dump the slave info

--safe-slave-backup says stop the slave and start it back when the backup finishes *THE PROBLEM*

--incremental-basedir is the last successful incremental directory

The problem is I would get alerted every-time incremental runs forcing me to acknowledge the alert.

This is annoying so let's fix this with automation. (if the slave is off it's okay I am backing up the DR)
So the new algorithm is:

Mount
Set downtime over the OPENNMS Rest API
Backup
Remove downtime over the opennms rest API
Unmount


Setting downtime for OpenNMS is hard since the documentation is not helpful. Good news I was able to find some online code to see how things worked. OpenNMS is an opensource product with code viewable from fisheye. Anytime you want to figure out how an API works, look at the test code of said API or read the API twists and turns; I did just that by reading this.


http://fisheye.opennms.org/browse/opennms/opennms-webapp/src/test/java/org/opennms/web/rest/ScheduledOutagesRestServiceTest.java?hb=true

But nothing I found told me how to authenticate to use the REST API. Thus searching around and looking at the code shipped with Opennms I found a perl script called provision.pl in /opt/opennms/bin-it uses COOKIES!

Good thing I am good with PERL, thus I went ahead to try to do some simple fetches until I realize that getting LWP, HTTP and various other perl modules including ISBN::Data was near impossible. Making an RPM for each one is just a huge waste of time and forcing CPAN installs across N boxes sucks not to mention its just plain WRONG.

So I looked at my options; PYTHON is loaded by centos by default and has built in httplibs like urllib2, http, cookie etc. Perfect.
Below is the script

#!/usr/bin/python

#
# @author Dathan Vance Pattishall
# OpenNMS Schedule downtime script
#

import urllib, urllib2, cookielib, pprint, os, sys
import elementtree.ElementTree as ET
import time
from datetime import date, tzinfo, timedelta, datetime
from optparse import OptionParser

usage = "usage: %prog [options]"
parser = OptionParser(usage=usage)
parser.add_option("-v", "--verbose", action="store_true", dest="verbose", default=False, help="make lots of noise [default]")
parser.add_option("-t", "--downtime", dest="downtime", help="length of downtime")
parser.add_option("-c", "--contains", dest="contains", help="Schedule downtime for all nodes that contain this string")
parser.add_option("-l", "--like", dest="like", help="get servers with a wild card e.g. shard%-dr")
parser.add_option("-d", "--delete", action="store_true", dest="delete", default=False, help="delete and outage")
parser.add_option("-p", "--package",  dest="package", default="SFO", help="which nms package")



(options, args) = parser.parse_args()


base_url = 'https://enteropennmshosthere/opennms/'
auth_url = base_url + 'j_spring_security_check'
nodes_url = base_url + 'rest/nodes/'
sched_outage_url = base_url + 'rest/sched-outages/'




username = 'outagerole'
password = 'add_pass_here'

#
# get the hostname
#
hostname  = os.environ['HOSTNAME']
host_abbr = os.environ['HOSTNAME'].split('.')[0]

if not host_abbr :
    print("Environment is not setup correctly")
    sys.exit(1)

#
# set up the cookie jar
#
cj = cookielib.CookieJar()

#
# build opener returns an OpenerDirector: http://docs.python.org/library/urllib2.html?highlight=urllib2#urllib2.OpenerDirector0
#
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

#
# now the cookie will work with urlopen as a global for all following requests
# 
urllib2.install_opener(opener)

#
# log-in and set the cookie
#
login_data = urllib.urlencode({'j_username' : username, 'j_password' : password, 'Login': 'Login'})
opener.open(auth_url, login_data)

#
# get the nodes
#

if options.contains :

    url_data = { 'label' : options.contains, 'limit' : 0, 'comparator' : 'contains' }
    url_data = urllib.urlencode(url_data)
    resp = opener.open(nodes_url + '?' + url_data)
    outagename = options.contains + "-contiains-script-outage"

elif options.like :

    url_data = { 'label' : options.like + '%', 'limit' : 0, 'comparator' : 'ilike' }

url_data = urllib.urlencode(url_data)
    resp = opener.open(nodes_url + '?' + url_data)
    outagename = options.like.replace('%', 'WC') + "-ilike-script-outage"

else :
    #
    # get the hostname info to see if the host is in opennms
    #
    url_data = { 'label' : hostname, 'limit' : 0 }
    url_data = urllib.urlencode(url_data)
    resp = opener.open(nodes_url + '?' + url_data)
    outagename = host_abbr + "-script-outage"

#
# tree is an Element
# http://docs.python.org/library/xml.etree.elementtree.html?highlight=elementtree#element-objects
#
tree = ET.XML(resp.read())

#
# build a label name to id map
#
name_to_id_map = {}

for node in tree.getiterator('node') :
    if node.get('label') :
        name_to_id_map[node.get('label')] = node.get('id')

if not name_to_id_map and options.contains:
    print("This HOST [%s] is not in nms" % options.contains)
    sys.exit(1)

#
# delete an outage - really outage name is all that is needed 
#
if options.delete :
    try :
        print("Deleting Outage: %s" % outagename)
        req3 = urllib2.Request(sched_outage_url + outagename)
        req3.add_header('Content-Type', 'application/xml')
        req3.add_header('Content-Length', '0')
        req3.get_method = lambda: 'DELETE'
        r = urllib2.urlopen(req3)

    except urllib2.HTTPError :
        print("This Outage: %s was already deleted" % outagename)

    sys.exit(0)


#
# schedule downtime
#
print("Scheduling downtime for 1 hour Outage Name %s" % outagename)

start = datetime.today()
downtime = 1 #1 hour

if options.downtime :
    downtime = int(options.downtime) # units of hours


end = start + timedelta(hours=downtime)

end = end.strftime('%d-%b-%Y %H:%M:%S')
start = start.strftime('%d-%b-%Y %H:%M:%S')

print("Start of the downtime: %s" % start)
print("End of the downtime: %s" % end)

#
# build the request
#
req = urllib2.Request(sched_outage_url)
req.add_header('Content-Type', 'application/xml')


xml_str = "" + "<outage name=" + outagename + " type="specific">" + "<time begins=" + str(start) + " ends=" + str(end) + ">"

for nodename in name_to_id_map :
    xml_str += "<node id=" + name_to_id_map[nodename] + ">"

xml_str += "</node></time></outage>"

#
# send the XML
#
req.add_data(xml_str)
r = urllib2.urlopen(req)

#
# tell notifd to attach to the downtime
#
req2 = urllib2.Request(sched_outage_url + outagename + '/notifd')
req2.add_header('Content-Type', 'application/xml')
req2.add_header('Content-Length', '0')
req2.get_method = lambda: 'PUT'
r = urllib2.urlopen(req2)

#
# tell pollerd to attach to the downtime
#
req3 = urllib2.Request(sched_outage_url + outagename +'/pollerd/' + options.package)
req3.add_header('Content-Type', 'application/xml')
req3.add_header('Content-Length', '0')
req3.get_method = lambda: 'PUT'
r = urllib2.urlopen(req3)

sys.exit(0)

This is a quick and dirty script which I will eventually turn into a class to control openNMS from the command line. In summary it logs in with the specified user. Gets a cookie. Issues commands pulled from reading Java code (good thing I can code good in java as well). My main problem was searching to find that there are unpublished filters like comparator => contains and finding that the post structure for schedule-outages was not key value param but XML!!

public void testSetOutage() throws Exception {
       String url = "/sched-outages";
       String outage = "<?xml version=\"1.0\"?>" +
               "<outage name='test-outage' type='specific'>" +
               "<time day='friday' begins='13:20:00' ends='15:30:00' />" +
               "<time begins='17-Feb-2012 19:20:00' ends='18-Feb-2012 22:30:00' />" +
               "<node id='11' />" +
               "</outage>";
       sendPost(url, outage);
}

In summary everything works well. Backups are working and I am happy to not send pages. Eventually when I get around to it I'll upload this script to git. 

No comments: