Monday, May 13, 2013

How to pick indexes is the same for MongoDB as mySQL


I recently went to MongoDB Days, a conference about everything MongoDB in SF. Starting my career as a Systems Programmer then Web Developer, MySQL DB[Admin|Architect], to Software|System Architecture I like to keep an open mind about new technology and trends. When you work with a lot of different languages, and technology you find out that it’s basically the same Science from about 40 years ago.

An index in MongoDB is like an index in mySQL since a Btree is a Btree regardless of what application uses it. Just like with mySQL the best performance improvement for an application using MongoDB as a datastore is adding the correct indexes.


To create an index in MongoDB:

db.<tableName>.ensureIndex({ col#1:1, col#2:-1, col#3:1 }); // note 1 means ASC -1 means DESC

MongoDB follows the same left-most-prefix rule meaning

col#1, col#2, col#3 is an index
col#1, col#2 is an index
col#1 is an index

col#2, col#3 IS NOT AN INDEX

So, just like with mySQL for ONE compound index you get a total of THREE indexes if you follow the left-most-prefix rule, the columns from left to right (in order) in a compound index is an index.

MongoDB also gets a performance boost by using Covering indexes just like mySQL. What is a Covering index? Instead of reading from the datapage (or document store for MongoDB) which exists on disk your reading the data from the index which should be in memory for the most part. A common practice is to follow the left-most-prefix pattern, then add the columns which you are returning at the end of the compound index. For instance

SELECT photoId from Photos WHERE userId=? AND dateCreate=? AND privacy=?

The index in mySQL I would make is

ALTER TABLE Photos ADD INDEX `userId-dateCreate-privacy-photoId` (userId,dateCreate,privacy,photoId)

Thus following the left-most-prefix of a compound index I have an index on

userId, dateCreate, privacy, photoId
userId, dateCreate, privacy
userId, dateCreate
userId

and a Covering Index satisfied by the query above.

For mongoDB its the same

db.photos.ensureIndex({ userId: 1, dateCreate: 1, privacy: 1, photoId: 1});


So, in conclusion, understand the Computer Science of a Btree, Hash, LinkedList and you will understand how indexes work across technology and find that essentially it's the same. More info on indexes for mySQL can be found here.

Also note:

Explain in mongoDB is your friend just like Explain in mySQL

Tuesday, April 23, 2013

I'll be speaking at Percona Live April 24th in Ballroom F


WHEN YOU SCALE OUT HOW TO SCALE IN BY INCREASING DENSITY USING RAIDED SSD


Sharding-splitting data for a single database server onto many database servers is a method to scale horizontally and is needed to get more disk IOPS from a mechanical hard drive server architecture. A method that works yet has pitfalls, which this session talks about. The focus is what happens when Solid State Disk Drives replaces traditional mechanical hard drives (spinning metal) in a sharded environment and answers to questions like

How much more IOPS with SSD?

What Raid Levels and Controllers work SSD drives?

How do you migrate data from shards to increase density on SSD Shards?

Why Multi MySQL, Instances per SSD server is great.

How INNODB compression really helps in an SSD environment.

This is my 5th or 6th Presentation since the conferences began. I am looking forward to spending time with old friends and people who love to talk about scale, data, technical stuff-like you! Come see me when you get a chance. Note: I will post my slides here.

Shard-In after Sharding Out with SSD

Friday, March 22, 2013

Instant InnoDB by Matthew Reid Review

Instant InnoDB by PACKT publishing is 83 page book on getting a basic understanding of InnoDB. Enough of an understanding to know what questions to ask in the future. The book is not for intermediate or advance Innodb users. It has the basics to get you started. Some basics covered, defining ACID, a page on Innodb params for mySQL that run innodb as a few examples.

Overall I think its a good book to get you started. The problem is the book just doesn't flow. There is a section on benchmarking with a blurb about 4 tools. This book could use more meat in explaining what are good numbers or bad-as a result of the benchmark and how they relate to Innodb. Also these section do not explain what Status variables to look at in deriving how well Innodb is performing through these tests. Finally the section on INNODB STATUS, one of the more important monitors for Innodb doesn't have much explanation.  This book gives you a good idea on how to get the information but not enough on what this information means. Overall for a mini-reference, or introduction to Innodb this is a good place to start, its an easy read and you can finish it on four BART rides from Daily City to Montgomery Street in San Francisco. If you are looking for more details on the broad subjects covered you are going to need another reference.

Wednesday, February 06, 2013

Multi MySQL is perfect for SSDs and Removing Replication Lag


Ever run into a problem where an application's Master is accepting all the writes and the master's slaves lags? Ever had this same problem yet the Slave has a raid array of very fast SSDs? You must be thinking to yourself, "I call BS, Dathan; a SSD drive can do crazy IOPS". No, it's true and the reason is that mySQL only has two threads for replication.

Thread 1 is called the IO thread and its job is to download the binary logs from the server's master into relay logs on the slave. This is rarely a bottleneck since this is typically sequential disk access. Thread 2 is called the SQL thread. It's job is to take the data which was downloaded from the IO thread and apply it. Since there is 1 thread applying the SQL and that 1 thread has to read each row before writing it (that is just how a write works), that 1 Thread just couldn't move fast enough or use enough resources by itself and lags.

Here are my findings. The master was producing 8-10K IOPS of write traffic, yet with a single mySQL thread, it could only do 2K-2.5K IOPS with a bunch of tweaks and tunable I happen to know about.

Below is a graph




This is a graph of IOPS on a slave. From Sep to Dec, I wasted IOPS on such beautiful hardware with a single replication thread. My beliefe is if you're not utilizing 100% of your hardware resources where 100% is defined as the resources not reserved for growth you are wasting hardware. I needed more threads to keep writes caught up was my conclusion and my thought was I will either need to write my own replication layer which pushes data OR do something that just solves a bunch of problems. Multi-mySQL-Server Instance on a single Server is my solution to utilize more IOPS.

For this application, I need fast IOPS and a lot of disk-space. This particular application counts all the page views, records referrers, records uniques, etc for millions of sites across the web realtime. The owners or blessed administrators of the site are the only one that see detail traffic stats. The application produces roughly 6K-12K inserts a second. Also the data exists for at least two years and many readers of the data could be required in the future.

My approach to solve the application issues are the following.
  • Analyze the data. How is it written?
  • How many IOPS are produced today?
  • Will the IOP request rate grow over time?
  • Is the query pattern mainly inserts, updates, selects or all of the above? 
Upon analysis, I found that there is a lot of data, mainly new entries (rows) are written throughout the day. I found that traffic in terms of inserts are basically a function of viewed pages for a particular site. It just makes sense-more views more writes, thus more inserts. Since I know I need more IOPS on the slave I need more replication threads.  I need the ability to grow the number of replication threads over time. Also since data just grows I need to be able to grow disk space over time as well.

So I said what if I took the hostname converted it to a bigint and assigned it a database. I can in theory have a server per database. This will allow me to grow to at least 1 write server per hostname. Sharding is yet born again but within a single server ;).

Since 1 thread of SQL replication can use up to 2.5K IOPS, let's group all the hostnames into buckets/databases (note buckets and databases from this point is the same thing). Below is an image of grouping hosts into 28 buckets by the number of pageviews. I used a random distribution by taking 500K+ sites and summing their page-views.



Here we see that there is a pretty good distribution of pageviews per bucket except for the two major spikes. Thus isolating this grouping of very popular sites to a 4 more buckets gave it the same flat-ish type graph. Now I have 32 buckets to handle all the sites.

From other benchmarks done and not in this scope my benchmark-data points to a single server doing 40K IOPS+ on 1TB of Raid-5 SSD.  Taking the 32 buckets where each bucket is now a database I assign them to 4 mySQL servers. So each server has 8 databases. 4 mysql servers == 4 replication threads. This should give me 10K IOPS no problem thus keeping up with the master. Additionally not confuse people that are use to connecting to mySQL from the command line the multi-mySQL server config starts at port 3307 and ends at port 3310.




The Diagram above shows how the Application picks which server by hitting a specific port on the Physical Server. Based on the server name I know what mySQL server and port to hit. Let's delve more into this.

Slowing down the writing to the master is a no-no. Adding a network lookup for every hostname stat-type to figure out what bucket said hostname belongs to is not efficient. The application only knows the fully qualified domain name thus passing a numerical representation of the hostname is not possible.

To get around this I use consistent hashing-a calculation inside application space which converts a hostname to a numerical bigint representation. This numerical representation from the consistent hash allows me to build a range of ids to assign to a database. This approach allows me to move a block of hosts where a block can be a single host to another arbitrary bucket allowing growth past 32 databases.

Here is my lookup

        if ( $id < 91613391575) return 1;
        if ( $id >= 91613391575 && $id < 575617478762710049) return 1;
        if ( $id >= 575617478762710049 && $id < 1151652970532347073) return 2;
        if ( $id >= 1151652970532347073 && $id < 1728381697671314740) return 3;
        if ( $id >= 1728381697671314740 && $id < 2305255992978150432) return 4;
        if ( $id >= 2305255992978150432 && $id < 2879934343862663299) return 5;
        if ( $id >= 2879934343862663299 && $id < 3456586097308066472) return 6;
        if ( $id >= 3456586097308066472 && $id < 4034127590667991228) return 7;
        if ( $id >= 4034127590667991228 && $id < 4610886495740908215) return 8;
        if ( $id >= 4610886495740908215 && $id < 5188089445597040650) return 9;
        if ( $id >= 5188089445597040650 && $id < 5764363206983886628) return 10;
        if ( $id >= 5764363206983886628 && $id < 6340149158128856460) return 11;
        if ( $id >= 6340149158128856460 && $id < 6917639753751591942) return 12;
        if ( $id >= 6917639753751591942 && $id < 7494903696463075964) return 13;
        if ( $id >= 7494903696463075964 && $id < 8070423312614934443) return 14;
        if ( $id >= 8070423312614934443 && $id < 8645754150017603414) return 15;
        if ( $id >= 8645754150017603414 && $id < 9222136639934948903) return 16;
        if ( $id >= 9222136639934948903 && $id < 9797731625031589702) return 17;
        if ( $id >= 9797731625031589702 && $id < 10374872366398685937) return 18;
        if ( $id >= 10374872366398685937 && $id < 10951636964146180677) return 19;
        if ( $id >= 10951636964146180677 && $id < 11527508592358295485) return 20;
        if ( $id >= 11527508592358295485 && $id < 12104380263636744674) return 21;
        if ( $id >= 12104380263636744674 && $id < 12682234005971217988) return 22;
        if ( $id >= 12682234005971217988 && $id < 13259221689642660934) return 23;
        if ( $id >= 13259221689642660934 && $id < 13836008077021721634) return 24;
        if ( $id >= 13836008077021721634 && $id < 14411538489601958181) return 25;
        if ( $id >= 14411538489601958181 && $id < 14989084168978764931) return 26;
        if ( $id >= 14989084168978764931 && $id < 15565313755211130708) return 27;
        if ( $id >= 15565313755211130708 && $id < 16142081929461448429) return 28;
        if ( $id >= 16142081929461448429 && $id < 16717869929339795988) return 29;
        if ( $id >= 16717869929339795988 && $id < 17293687793251719004) return 30;
        if ( $id >= 17293687793251719004 && $id < 17869194657238150155) return 31;
        if ( $id >= 17869194657238150155 && $id < 18446689638439209978) return 32;
        return 32;

Looking at these number, you can see that at times they are separated by 1 trillion or so. This doesn't mean there is 1 trillion sites, it just means that 1 trillion sites could clump up in a bucket. I am not concern about that. I can take say 100 billion of a bucket/database, move the data to  another database say bucket/database 33 without a problem. I of course would need to let the users know that their data is unavailable during this move for the sites that fall with-in the 100 billion, but moving data is fast per site-so they will not be too inconvenienced.

Also you might wonder since a hash is being used, what if two distinct hostnames hash to the same id? Its pretty unlikely but if that is the case, the Primary KEY on the table is siteIdHash, Hostname.

In conclusion this is working great. With each ETL running on each webserver 2 Servers Handle the stats for 10s of millions of websites with the data always up to date and the redundant server has 0 seconds of replication lag at all times.

Tuesday, December 04, 2012

Raid Stripe Size, Raid Stripe Segment Size the definition and SSD consideration



For years now I have been confusing Stripe Segment Size with Stripe Size, when configuring my RAID arrays. I always thought that Stripe Size is the number of KB of a file written per disk before moving to the next disk in the array. So, if I had a stripe size of 16KB and I am writing a 20KB file-16KB would be on the first disk, 4KB would be on the second. This concept of moving to the next data bearing disk in the range is not the stripe size. This is the stripe segment size; in lays the confusion. The stripe size is 16K * Number of spindles. So when setting the stripe size in your raid array its the stripe segment size * the number of spindles. In this example of setting 16KB stripe size the stripe segment size is 16KB/# of data bearing disks.

After reading a pdf and this web page links provided to me from Richard Hesse, the definition is defined.

Let's define a stripe:
A set of contiguous segments spanning across member drives creates a stripe
For example, in a RAID 5,

4 + 1 disk group with a stripe segment size of 128KB, the first 128KB of an I/O is written to the first drive, the next 128KB to the next drive, and so on with a total stripe size of 512KB.
The stripe width is 4 since there are 4 spindles doing the work, 1 spindle for the parity bits.  The 4 spindles doing the work are also known as data bearing drives.

4 disk * 128K stripe segment size = 512K stripe size

or using algebra (ignoring units)

 512K stripe size / 4 disk = 128K stripe segment size
Thus when setting a stripe size this is the contiguous blocks on an array of stripe segments.

Why revisit this?

According to this acm article on SSD Anatomy the program/erase block is 8KB. For me Optimizing Raid speed for sequential access is less of a concern over Optimizing for Longevity. SSD's by nature have a life span based on the number of writes to an SSD drive.  Therefore, the process of erase/program starts killing off the longevity of a drive. Thus, having a large stripe and modifying that stripe can cause a large amounts of erase/programs killing longevity.

Richard Hesse current settings is a 32KB stripe across a RAID5 SSD array of 8 disks-that’s 7 data bearing drives. This means that to write a 32KB stripe, that’s 4.57KB per SSD data bearing drive which falls inline with the 8KB block programmed/erased increasing the longevity much more overall.

Why not use more Raid5 4+1 with a 32KB stripe to get a 8KB stripe element size per data bearing disk? For this setup the data is large and requires a lot of IOPS thus the best configuration for the requirements while still keeping longevity some what in check.


What about NON-SSD drives? For INNODB workloads, it’s all about tuning for how your application typically uses the database and how IOPS are used. INNODB DISK IOPS are rather predicable verses other Engine IOP usage. A Page is 16KB. When selecting rows, mySQL is pulling 1 to many pages. Many rows can fit in 1 page or span more then 1 page. INNODB typically tries to align pages so sequential disk access has a higher chance when pulling groupings of rows. Therefore, for spinning metal drives having a raid stripe size of 64KB-512KB is certainly plausible and recommended. Typically for my web apps, 256KB has been a good stripe size. My disk setup is typically Raid-10 across 8 2.5" 15K RPM drives, so that is 4 data bearing drives with a raid segment size of 256/8 == 32KB  or roughly 2 pages per data bearing disks. One thing of note, its really hard to find a sweet spot and typically you will notice better performance gains on spinning metal drives with XFS alignments su, sw options or follow this guide by Jay.





Friday, July 13, 2012

A Case for Automation of OpenNMS for XTRADB Backups with Python

If you read my blog from time to time, you know I am a huge advocate of automating tasks. If I have to do something more then once I might was well automate the task.
Currently at my new gig, we are using opennms. OpenNMS is a combo of Nagios, Ganglia talking over SNMP Java goodness. They recently released a REST API that is less then adequately documented making integration rather difficult. But, I am jumping ahead let's talk about the problem.


Backing up an INNODB Slave is rather easy with Percona's xtrabackup program. It's free, works, offers better features than its paid counter part. Additionally its opensource.

In my environment my algorithm is as follows:

Mount A Huge Disk over NFS which gets snapshotted
At the Start of the day (00:00:00 UTC) do a full backup.
The command used: innobackupex --user=root --password='***' $backupDir --parallel=8 --slave-info
At every top of the hour do an incremental backup if a full backup does not exist

The problem is for me to do an incremental backup I use the following command


my $INNOBACKUP="innobackupex --user=root --password='***' $incrementalDir --parallel=8 --incremental --slave-info --safe-slave-backup --incremental-basedir=";

$INNOBACKUP .= $lastIncremental;

--incremental says do an incremental backup

--slave-info says dump the slave info

--safe-slave-backup says stop the slave and start it back when the backup finishes *THE PROBLEM*

--incremental-basedir is the last successful incremental directory

The problem is I would get alerted every-time incremental runs forcing me to acknowledge the alert.

This is annoying so let's fix this with automation. (if the slave is off it's okay I am backing up the DR)
So the new algorithm is:

Mount
Set downtime over the OPENNMS Rest API
Backup
Remove downtime over the opennms rest API
Unmount


Setting downtime for OpenNMS is hard since the documentation is not helpful. Good news I was able to find some online code to see how things worked. OpenNMS is an opensource product with code viewable from fisheye. Anytime you want to figure out how an API works, look at the test code of said API or read the API twists and turns; I did just that by reading this.


http://fisheye.opennms.org/browse/opennms/opennms-webapp/src/test/java/org/opennms/web/rest/ScheduledOutagesRestServiceTest.java?hb=true

But nothing I found told me how to authenticate to use the REST API. Thus searching around and looking at the code shipped with Opennms I found a perl script called provision.pl in /opt/opennms/bin-it uses COOKIES!

Good thing I am good with PERL, thus I went ahead to try to do some simple fetches until I realize that getting LWP, HTTP and various other perl modules including ISBN::Data was near impossible. Making an RPM for each one is just a huge waste of time and forcing CPAN installs across N boxes sucks not to mention its just plain WRONG.

So I looked at my options; PYTHON is loaded by centos by default and has built in httplibs like urllib2, http, cookie etc. Perfect.
Below is the script

#!/usr/bin/python

#
# @author Dathan Vance Pattishall
# OpenNMS Schedule downtime script
#

import urllib, urllib2, cookielib, pprint, os, sys
import elementtree.ElementTree as ET
import time
from datetime import date, tzinfo, timedelta, datetime
from optparse import OptionParser

usage = "usage: %prog [options]"
parser = OptionParser(usage=usage)
parser.add_option("-v", "--verbose", action="store_true", dest="verbose", default=False, help="make lots of noise [default]")
parser.add_option("-t", "--downtime", dest="downtime", help="length of downtime")
parser.add_option("-c", "--contains", dest="contains", help="Schedule downtime for all nodes that contain this string")
parser.add_option("-l", "--like", dest="like", help="get servers with a wild card e.g. shard%-dr")
parser.add_option("-d", "--delete", action="store_true", dest="delete", default=False, help="delete and outage")
parser.add_option("-p", "--package",  dest="package", default="SFO", help="which nms package")



(options, args) = parser.parse_args()


base_url = 'https://enteropennmshosthere/opennms/'
auth_url = base_url + 'j_spring_security_check'
nodes_url = base_url + 'rest/nodes/'
sched_outage_url = base_url + 'rest/sched-outages/'




username = 'outagerole'
password = 'add_pass_here'

#
# get the hostname
#
hostname  = os.environ['HOSTNAME']
host_abbr = os.environ['HOSTNAME'].split('.')[0]

if not host_abbr :
    print("Environment is not setup correctly")
    sys.exit(1)

#
# set up the cookie jar
#
cj = cookielib.CookieJar()

#
# build opener returns an OpenerDirector: http://docs.python.org/library/urllib2.html?highlight=urllib2#urllib2.OpenerDirector0
#
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

#
# now the cookie will work with urlopen as a global for all following requests
# 
urllib2.install_opener(opener)

#
# log-in and set the cookie
#
login_data = urllib.urlencode({'j_username' : username, 'j_password' : password, 'Login': 'Login'})
opener.open(auth_url, login_data)

#
# get the nodes
#

if options.contains :

    url_data = { 'label' : options.contains, 'limit' : 0, 'comparator' : 'contains' }
    url_data = urllib.urlencode(url_data)
    resp = opener.open(nodes_url + '?' + url_data)
    outagename = options.contains + "-contiains-script-outage"

elif options.like :

    url_data = { 'label' : options.like + '%', 'limit' : 0, 'comparator' : 'ilike' }

url_data = urllib.urlencode(url_data)
    resp = opener.open(nodes_url + '?' + url_data)
    outagename = options.like.replace('%', 'WC') + "-ilike-script-outage"

else :
    #
    # get the hostname info to see if the host is in opennms
    #
    url_data = { 'label' : hostname, 'limit' : 0 }
    url_data = urllib.urlencode(url_data)
    resp = opener.open(nodes_url + '?' + url_data)
    outagename = host_abbr + "-script-outage"

#
# tree is an Element
# http://docs.python.org/library/xml.etree.elementtree.html?highlight=elementtree#element-objects
#
tree = ET.XML(resp.read())

#
# build a label name to id map
#
name_to_id_map = {}

for node in tree.getiterator('node') :
    if node.get('label') :
        name_to_id_map[node.get('label')] = node.get('id')

if not name_to_id_map and options.contains:
    print("This HOST [%s] is not in nms" % options.contains)
    sys.exit(1)

#
# delete an outage - really outage name is all that is needed 
#
if options.delete :
    try :
        print("Deleting Outage: %s" % outagename)
        req3 = urllib2.Request(sched_outage_url + outagename)
        req3.add_header('Content-Type', 'application/xml')
        req3.add_header('Content-Length', '0')
        req3.get_method = lambda: 'DELETE'
        r = urllib2.urlopen(req3)

    except urllib2.HTTPError :
        print("This Outage: %s was already deleted" % outagename)

    sys.exit(0)


#
# schedule downtime
#
print("Scheduling downtime for 1 hour Outage Name %s" % outagename)

start = datetime.today()
downtime = 1 #1 hour

if options.downtime :
    downtime = int(options.downtime) # units of hours


end = start + timedelta(hours=downtime)

end = end.strftime('%d-%b-%Y %H:%M:%S')
start = start.strftime('%d-%b-%Y %H:%M:%S')

print("Start of the downtime: %s" % start)
print("End of the downtime: %s" % end)

#
# build the request
#
req = urllib2.Request(sched_outage_url)
req.add_header('Content-Type', 'application/xml')


xml_str = "" + "<outage name=" + outagename + " type="specific">" + "<time begins=" + str(start) + " ends=" + str(end) + ">"

for nodename in name_to_id_map :
    xml_str += "<node id=" + name_to_id_map[nodename] + ">"

xml_str += "</node></time></outage>"

#
# send the XML
#
req.add_data(xml_str)
r = urllib2.urlopen(req)

#
# tell notifd to attach to the downtime
#
req2 = urllib2.Request(sched_outage_url + outagename + '/notifd')
req2.add_header('Content-Type', 'application/xml')
req2.add_header('Content-Length', '0')
req2.get_method = lambda: 'PUT'
r = urllib2.urlopen(req2)

#
# tell pollerd to attach to the downtime
#
req3 = urllib2.Request(sched_outage_url + outagename +'/pollerd/' + options.package)
req3.add_header('Content-Type', 'application/xml')
req3.add_header('Content-Length', '0')
req3.get_method = lambda: 'PUT'
r = urllib2.urlopen(req3)

sys.exit(0)

This is a quick and dirty script which I will eventually turn into a class to control openNMS from the command line. In summary it logs in with the specified user. Gets a cookie. Issues commands pulled from reading Java code (good thing I can code good in java as well). My main problem was searching to find that there are unpublished filters like comparator => contains and finding that the post structure for schedule-outages was not key value param but XML!!

public void testSetOutage() throws Exception {
       String url = "/sched-outages";
       String outage = "<?xml version=\"1.0\"?>" +
               "<outage name='test-outage' type='specific'>" +
               "<time day='friday' begins='13:20:00' ends='15:30:00' />" +
               "<time begins='17-Feb-2012 19:20:00' ends='18-Feb-2012 22:30:00' />" +
               "<node id='11' />" +
               "</outage>";
       sendPost(url, outage);
}

In summary everything works well. Backups are working and I am happy to not send pages. Eventually when I get around to it I'll upload this script to git. 

Monday, May 21, 2012

mySQL Partitioning summary and what to watch out for

I have been looking for efficient ways to purge data from "Fast Disks" for applications that are time based and do not look at data after a time window has passed. For instance keeping a table where one stores log data from access logs and the data is okay to roll up the data and throw away the details every month. Another example is keeping a log of invites got a Facebook user from a Facebook user. Since the Facebook News feed is time based, stream publish postIds from 4 months ago do not necessarily need to be hot-the user rarely looks 4 months in the pass. Additionally with Timeline, it is hard to see a single story from 10 months ago if it is not as important as other stories. Keeping the data is necessary to delete Facebook posts from your application. So, get rid of each Invite off fast disk, and archive it on slow disks just in case you need to see the data again.

One solution is to code some logic into the application to switch SQL tables based on the create date of the row and map it to a table. This works but its not very clean.

Another solution is to walk the table in question and delete the old rows with DELETE statements. This is not good. The reason is DELETE IS VERY slow. In fact, the rule of thumb is this. You can do 10000s of SELECTS per seconds 10000s of Updates per second 1000s of Inserts per second, 10s of deletes a second. (I am hand waving and being general here). Additionally for INNODB (from this point just assume I am putting everything in the context of using INNODB) DELETES undo references are stored in the master ibdata file for innodb growing said data file unbounded over time. Ever notice that it grows even with innodb_file_per_table on? That's from deletes actions and is useless data. Additionally the only way to shrink that file is to do a full export to text and import after blowing away the master ibdata file(s).


Another solution is a mySQL partition table, think of it as a layer that sits in-front of the storage engine and relies heavily on the optimizer.  The partition table maps statements based on the partition setup to the correct underlying tables.  So, if you have 12 partitions for a table called FacebookInviteHistory, there will be 12 STORAGE ENGINE TABLES for the partition table FacebookInviteHistory in the format of

FacebookInviteHistory#P#<PartitionName>.ibd

Where FacebookInviteHistory is a pointer table in the format of

FacebookInviteHistory.par
FacebookInviteHistory.frm


The benefits of using partition table is the following:
  • Underlying tables can be assigned to specific disk media. 
  • Aggregate functions such as sum and count can easily be parallelized across all partitions providing very quick access to results. 
  • Data that loses it usefulness can be easily removed by dropping the partition containing only that data. 
  • Finally queries can be greatly optimized by the fact that data satisfying a given where clause can be stored only on 1 or more partitions which automatically excludes looking at other partitions-this is called pruning.
These are benefits  IF you set up partitioning correctly AND the partition engine gets enough info from the optimizer to pick the correct partition, else by default it queries ALL partitions. Querying all tables messes up the rule of thumb by reducing that 10K select number to 1K even if the data is not in the other partitions due to the fact of wasted iops and traveling the btree to be told that "Dude the data is not here".


This being said it still looks like a good feature to use but I have some questions which Google could not answer for me.

Question: Does adding or dropping partitions lock other partitions?
Short Answer: No
Long Answer:

For this table:

CREATE TABLE PartitionTest (
 senderId bigint(20) unsigned NOT NULL,
 recipientId bigint(20) unsigned NOT NULL,
 createDate datetime NOT NULL DEFAULT 0,
 PRIMARY KEY (senderId, recipientId, createDate)
)
PARTITION BY RANGE ( TO_DAYS(createDate) ) (
    PARTITION Jan2012 VALUES LESS THAN (TO_DAYS('2012-02-01')),
    PARTITION Feb2012 VALUES LESS THAN (TO_DAYS('2012-03-01')),
    PARTITION Mar2012 VALUES LESS THAN (TO_DAYS('2012-04-01')),
    PARTITION Apr2012 VALUES LESS THAN (TO_DAYS('2012-05-01')),
    PARTITION May2012 VALUES LESS THAN (TO_DAYS('2012-06-01')),
    PARTITION Jun2012 VALUES LESS THAN (TO_DAYS('2012-07-01')),
    PARTITION Jul2012 VALUES LESS THAN (TO_DAYS('2012-08-01'))
);



I wrote a benchmarking tool that simulates my type of concurrency writing at double the rate. Think of this tool as mysqlslap but specific for my table structures and load. Next I ran an administration command to add a partition drop it and add it again.

mysql> ALTER TABLE PartitionTest ADD PARTITION (PARTITION Aug2012 VALUES LESS THAN (TO_DAYS('2012-09-01')));
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE PartitionTest DROP PARTITION Aug2012;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE PartitionTest ADD PARTITION (PARTITION Aug2012 VALUES LESS THAN (TO_DAYS('2012-09-01')));
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0



QPS from the tool did not drop.


Question: Can you create a circular queue of Partition tables?
Short Answer: No in 5.1, Yes in 5.6 sort of.

Long Answer:

Lets imagine you wanted to create 12 partitions with a RANGE ON MONTH(createDate) each month would map to one of the 12 partitions. The goal is to drop data that is over two months old, so on June 1st-Jan to April's data can be dropped. This does not work in 5.1. In 5.1 you need to add ranges that is greater then the previous range. In 5.6 you can truncate a partition. But in both mysql versions pruning just does not work on a range on MONTH of createDate so its just not an efficient partition for reads. For date and datetime fields TO_DAYS and YEAR, are pruning friendly. In 5.6 TO_SECONDS is pruning friendly. For some good reading check out Mikael's blog on 5.1 and 5.6 date columns and pruning. Try not to use timestamp columns when partitioning data. Its been very buggy in the past and does not work well with pruning. Even though you may save 4 bytes per row with timestamp the work is just not in place to partition TIMESTAMP COLUMNS.

Question: Can you force reading from a partition?
Short Answer: Yes with some work in 5.1 but in 5.6 the feature is better.

Long Answer: If you know how the data is partitioned you can construct SQL to read the entire partition in 5.1. In 5.6 there is a command to read from a partition.

Question: For a partition of months why not use a LIST Type Partition? *From a comment below but made pretty*
Short Answer: datetime doesn't support MONTH for pruning.
Long Answer: Pruning on Lists for datetime is only optimized for TO_DAYS, YEAR according to the documentation. Every partition would be queried.

DROP TABLE IF EXISTS PartitionTest;

CREATE TABLE PartitionTest (
   senderId bigint(20) unsigned NOT NULL,
   recipientId bigint(20) unsigned NOT NULL,
   createDate datetime NOT NULL DEFAULT 0,
   PRIMARY KEY (senderId, recipientId, createDate)
)
PARTITION BY LIST ( MONTH(createDate) ) (
PARTITION Jan VALUES IN (1),
PARTITION Feb VALUES IN (2),
PARTITION Mar VALUES IN (3),
PARTITION Apr VALUES IN (4),
PARTITION May VALUES IN (5),
PARTITION Jun VALUES IN (6),
PARTITION July VALUES IN (7)
);




mysql> SELECT PARTITION_NAME, TABLE_ROWS, PARTITION_EXPRESSION, PARTITION_DESCRIPTION FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME = 'PartitionTest';
+----------------+------------+----------------------+-----------------------+
| PARTITION_NAME | TABLE_ROWS | PARTITION_EXPRESSION | PARTITION_DESCRIPTION |
+----------------+------------+----------------------+-----------------------+
|            Jan |        101 |    MONTH(createDate) |                     1 |
|            Feb |        101 |    MONTH(createDate) |                     2 |
|            Mar |        101 |    MONTH(createDate) |                     3 |
|            Apr |          0 |    MONTH(createDate) |                     4 |
|            May |          0 |    MONTH(createDate) |                     5 |
|            Jun |          0 |    MONTH(createDate) |                     6 |
|           July |          0 |    MONTH(createDate) |                     7 |
+----------------+------------+----------------------+-----------------------+
7 rows in set (0.01 sec)

mysql> DESCRIBE PARTITIONS SELECT * FROM PartitionTest WHERE createDate > '2012-01-01' AND createDate < '2012-02-01';
+----+-------------+---------------+------------------------------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table         | partitions                   | type  | possible_keys | key     | key_len | ref  | rows | Extra                    |
+----+-------------+---------------+------------------------------+-------+---------------+---------+---------+------+------+--------------------------+
|  1 |      SIMPLE | PartitionTest | Jan,Feb,Mar,Apr,May,Jun,July | index | NULL          | PRIMARY | 24      | NULL | 307  | Using where; Using index |
+----+-------------+---------------+------------------------------+-------+---------------+---------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
In summary I am rolling out partition tables to over 1.6 TB of data across 10 shards. So far so good.

Monday, May 14, 2012

Speeding up EC2 work by using AWS tools and scripts to bypass the AWS management console

Believe me managing EC2 instances is not as simple or magical as marketers would like for you to believe. The cloud gets complicated when it gets bigger. EC2 really only enables a person to ignore power, network layout (which is bad), and getting more servers that end up costing more then actual servers fast.

Things that EC2 is missing to make life easier for the developer:

Ability to update all servers with packages and code. Natively they do not support the ability to push files or install new software packages to server groups. Thus install cluster-it and puppet and write your own deploy program.

Server names and the EC2 AWS management console do not match. Everything is referenced by instance ids. The interface does not allow one to launch many instances in a named pattern so you have to go back and sync up the instance with the internal named used in the app. Syncing up the names is very tedious and time consuming process through the console.

Assigning EBS volumes is a pain in the ass as well. Essentially you need to assign them one by one, where each add takes more then 2 mins todo this is not good time spent. For instance it took me roughly 20 mins to attach 8 63GB EBS volumes to a single server.

Amazon is perfectly aware of these limitations-and do not hide from it. They are going after something bigger. They are providing a platform. There is an API for everything you need to make your work easier. Now the pain is to learn the API and use it in your favor. There are companies that built a business on making a better interface for the AWS console but getting it done yourself is cheaper.

My personal mantra is to automate things that I have to do more then once. Anytime that I deploy new instances, I take the private IP add it to DNS, make an API call through ec2-describe-instances, find the instance id and update the name through ec2-create-tags. This solves the problem that I have with mapping instance ids to my internal name which the app uses. For instance:
 
#!/usr/bin/perl -w 
#
#
use strict;
use Data::Dumper;
open(HOSTS, "</etc/hosts") or die($!);

my $hosts = {};
while(<HOSTS>){
    my ($ip, $hostname, undef) = split(/\s+/, $_);
    $hosts->{$ip} = $hostname;
}

 
 
open(FH, "/opt/aws/bin/ec2-describe-instances -C cert.pem -K x509.pem --region us-west-1|") or die($!);
while(<fh>){
    if ($_ =~ /^INSTANCE\t(.*)/){
        my (@fields) = split(/\s+/, $1);
        # 0 - instance
        # 1 - ami
        # 2 - public dns
        # 3 - private dns
        # 4 - state
        # 5 - ??
        # 6 - ??
        # 7 - instance type
        # 8 - date created
        # 9 - DC
        # 10 - ??
        # 11 - monitoring state
        # 12 - public ip
        # 13 - private ip
        # 14 - ebs
        # 15 - ??
 
     if ($fields[4] eq 'running'){
            my $role;
            my $hostname = $hosts->{$fields[13]};
             
            if (!$hostname) {
                print "$fields[13] is not in the hosts file skipping..\n";
                next;
            }
            if ($hostname =~ /^job/){
                $role = 'gearman-worker';
            }
            if ($hostname =~ /^gearman/){
                $role = 'gearman-queue';
            }
            if ($hostname =~ /^www/){
                $role = 'webserver';
            }
            if ($hostname =~ /^memc/){
                $role = 'memcache';
            }
            if ($hostname =~ /^db/){
                $role = 'database';
            }
            if ($hostname =~ /^dbshard/) {
                $role = 'database-shard';
            }
            if (!$role){
                print "$hostname does not have a role\n";
                $role = 'other';
            }
            system("./aws/bin/ec2-create-tags -C cert.pem -K x509.pem --region us-west-1 ".$fields[0] ." --tag Na
me=$hostname --tag Role=$role");
        }
    }
}

Now to attach disks to an instance, that I am upgrading or re-purposing I wrote a quick script that describes the input instance after translating from my internal name to instance id. Calculates the size of each disk and attaches said disks. For instance:
 
#!/usr/bin/perl -w
#
#
use strict;
use Data::Dumper;
use POSIX qw(ceil);
 
print "Enter Hostname: ";
my $hostname = <>;
chomp($hostname);
my $cmd = './aws/bin/ec2-describe-instances -C cert.pem -K x509.pem --region us-west-1 --filter="tag-key=Name" --
filter="tag-value=' . $hostname . '"';
 
open(FH, "$cmd|") or die ("Awesome death: $!\n");
 
my $instance = "";
my $lastDisk = "";
my $diskCount = "";
while(<FH>){
 
 
if($_ =~ /^INSTANCE\t(.*)/){
        my (@fields) = split(/\s+/, $1);
        $instance = $fields[0];
        print "Instance=$instance\n";
    }
  
    if($_ =~ /^BLOCKDEVICE\t(.*)/){
        my (@fields) = split(/\s+/, $1);
        $diskCount++;
        $lastDisk = $fields[0];
        print "$lastDisk\n";
    }
}
 
print "How many disks you would like to add: ";
my $totalAddDisks = <>;
chomp($totalAddDisks);
 
print "You picked $totalAddDisks\n";
print "What is the total size of the Raid0 Array in GB: ";
my $totalSize = <>;
chomp($totalSize);
 
print "You picked $totalSize GB\n";
my $sizeperdisk = ceil($totalSize/$totalAddDisks);
print "The size per disk: $sizeperdisk\n";
 
$lastDisk =~ /sd(\S)/;
my $lastDeviceLetter = $1;
my @devicesavail = ($lastDeviceLetter .. 'z');
 
for(my $i = 1; $i <= $totalAddDisks; $i++){
    $cmd = "./aws/bin/ec2-create-volume --size $sizeperdisk --region us-west-1 --availability-zone us-west-1c -C cert.pem  -K
 x509.pem";
  
    my $ret = `$cmd`;
    my (@output) = split(/\s+/, $ret);
    
    $cmd = "./aws/bin/ec2-attach-volume --region us-west-1 -C cert.pem -K x509.pem $output[1] --instance $instanc
e --device /dev/sd$devicesavail[$i]";
    $ret = `$cmd`;
    
    if ($ret =~ /attaching/){
        print "All good do the next one\n";
    } else {
        die("Did not work\n");
    }
}

These are rough and dirty scripts that get the job done for my environment. The end goal when given time is to turn these scripts into a package talking over httpd that makes life easier when working in EC2. Using these two script have reduced the management time from 1 hour per server upgrade to a few minutes.