Monday, May 21, 2012

mySQL Partitioning summary and what to watch out for

I have been looking for efficient ways to purge data from "Fast Disks" for applications that are time based and do not look at data after a time window has passed. For instance keeping a table where one stores log data from access logs and the data is okay to roll up the data and throw away the details every month. Another example is keeping a log of invites got a Facebook user from a Facebook user. Since the Facebook News feed is time based, stream publish postIds from 4 months ago do not necessarily need to be hot-the user rarely looks 4 months in the pass. Additionally with Timeline, it is hard to see a single story from 10 months ago if it is not as important as other stories. Keeping the data is necessary to delete Facebook posts from your application. So, get rid of each Invite off fast disk, and archive it on slow disks just in case you need to see the data again.

One solution is to code some logic into the application to switch SQL tables based on the create date of the row and map it to a table. This works but its not very clean.

Another solution is to walk the table in question and delete the old rows with DELETE statements. This is not good. The reason is DELETE IS VERY slow. In fact, the rule of thumb is this. You can do 10000s of SELECTS per seconds 10000s of Updates per second 1000s of Inserts per second, 10s of deletes a second. (I am hand waving and being general here). Additionally for INNODB (from this point just assume I am putting everything in the context of using INNODB) DELETES undo references are stored in the master ibdata file for innodb growing said data file unbounded over time. Ever notice that it grows even with innodb_file_per_table on? That's from deletes actions and is useless data. Additionally the only way to shrink that file is to do a full export to text and import after blowing away the master ibdata file(s).


Another solution is a mySQL partition table, think of it as a layer that sits in-front of the storage engine and relies heavily on the optimizer.  The partition table maps statements based on the partition setup to the correct underlying tables.  So, if you have 12 partitions for a table called FacebookInviteHistory, there will be 12 STORAGE ENGINE TABLES for the partition table FacebookInviteHistory in the format of

FacebookInviteHistory#P#<PartitionName>.ibd

Where FacebookInviteHistory is a pointer table in the format of

FacebookInviteHistory.par
FacebookInviteHistory.frm


The benefits of using partition table is the following:
  • Underlying tables can be assigned to specific disk media. 
  • Aggregate functions such as sum and count can easily be parallelized across all partitions providing very quick access to results. 
  • Data that loses it usefulness can be easily removed by dropping the partition containing only that data. 
  • Finally queries can be greatly optimized by the fact that data satisfying a given where clause can be stored only on 1 or more partitions which automatically excludes looking at other partitions-this is called pruning.
These are benefits  IF you set up partitioning correctly AND the partition engine gets enough info from the optimizer to pick the correct partition, else by default it queries ALL partitions. Querying all tables messes up the rule of thumb by reducing that 10K select number to 1K even if the data is not in the other partitions due to the fact of wasted iops and traveling the btree to be told that "Dude the data is not here".


This being said it still looks like a good feature to use but I have some questions which Google could not answer for me.

Question: Does adding or dropping partitions lock other partitions?
Short Answer: No
Long Answer:

For this table:

CREATE TABLE PartitionTest (
 senderId bigint(20) unsigned NOT NULL,
 recipientId bigint(20) unsigned NOT NULL,
 createDate datetime NOT NULL DEFAULT 0,
 PRIMARY KEY (senderId, recipientId, createDate)
)
PARTITION BY RANGE ( TO_DAYS(createDate) ) (
    PARTITION Jan2012 VALUES LESS THAN (TO_DAYS('2012-02-01')),
    PARTITION Feb2012 VALUES LESS THAN (TO_DAYS('2012-03-01')),
    PARTITION Mar2012 VALUES LESS THAN (TO_DAYS('2012-04-01')),
    PARTITION Apr2012 VALUES LESS THAN (TO_DAYS('2012-05-01')),
    PARTITION May2012 VALUES LESS THAN (TO_DAYS('2012-06-01')),
    PARTITION Jun2012 VALUES LESS THAN (TO_DAYS('2012-07-01')),
    PARTITION Jul2012 VALUES LESS THAN (TO_DAYS('2012-08-01'))
);



I wrote a benchmarking tool that simulates my type of concurrency writing at double the rate. Think of this tool as mysqlslap but specific for my table structures and load. Next I ran an administration command to add a partition drop it and add it again.

mysql> ALTER TABLE PartitionTest ADD PARTITION (PARTITION Aug2012 VALUES LESS THAN (TO_DAYS('2012-09-01')));
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE PartitionTest DROP PARTITION Aug2012;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE PartitionTest ADD PARTITION (PARTITION Aug2012 VALUES LESS THAN (TO_DAYS('2012-09-01')));
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0



QPS from the tool did not drop.


Question: Can you create a circular queue of Partition tables?
Short Answer: No in 5.1, Yes in 5.6 sort of.

Long Answer:

Lets imagine you wanted to create 12 partitions with a RANGE ON MONTH(createDate) each month would map to one of the 12 partitions. The goal is to drop data that is over two months old, so on June 1st-Jan to April's data can be dropped. This does not work in 5.1. In 5.1 you need to add ranges that is greater then the previous range. In 5.6 you can truncate a partition. But in both mysql versions pruning just does not work on a range on MONTH of createDate so its just not an efficient partition for reads. For date and datetime fields TO_DAYS and YEAR, are pruning friendly. In 5.6 TO_SECONDS is pruning friendly. For some good reading check out Mikael's blog on 5.1 and 5.6 date columns and pruning. Try not to use timestamp columns when partitioning data. Its been very buggy in the past and does not work well with pruning. Even though you may save 4 bytes per row with timestamp the work is just not in place to partition TIMESTAMP COLUMNS.

Question: Can you force reading from a partition?
Short Answer: Yes with some work in 5.1 but in 5.6 the feature is better.

Long Answer: If you know how the data is partitioned you can construct SQL to read the entire partition in 5.1. In 5.6 there is a command to read from a partition.

Question: For a partition of months why not use a LIST Type Partition? *From a comment below but made pretty*
Short Answer: datetime doesn't support MONTH for pruning.
Long Answer: Pruning on Lists for datetime is only optimized for TO_DAYS, YEAR according to the documentation. Every partition would be queried.

DROP TABLE IF EXISTS PartitionTest;

CREATE TABLE PartitionTest (
   senderId bigint(20) unsigned NOT NULL,
   recipientId bigint(20) unsigned NOT NULL,
   createDate datetime NOT NULL DEFAULT 0,
   PRIMARY KEY (senderId, recipientId, createDate)
)
PARTITION BY LIST ( MONTH(createDate) ) (
PARTITION Jan VALUES IN (1),
PARTITION Feb VALUES IN (2),
PARTITION Mar VALUES IN (3),
PARTITION Apr VALUES IN (4),
PARTITION May VALUES IN (5),
PARTITION Jun VALUES IN (6),
PARTITION July VALUES IN (7)
);




mysql> SELECT PARTITION_NAME, TABLE_ROWS, PARTITION_EXPRESSION, PARTITION_DESCRIPTION FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME = 'PartitionTest';
+----------------+------------+----------------------+-----------------------+
| PARTITION_NAME | TABLE_ROWS | PARTITION_EXPRESSION | PARTITION_DESCRIPTION |
+----------------+------------+----------------------+-----------------------+
|            Jan |        101 |    MONTH(createDate) |                     1 |
|            Feb |        101 |    MONTH(createDate) |                     2 |
|            Mar |        101 |    MONTH(createDate) |                     3 |
|            Apr |          0 |    MONTH(createDate) |                     4 |
|            May |          0 |    MONTH(createDate) |                     5 |
|            Jun |          0 |    MONTH(createDate) |                     6 |
|           July |          0 |    MONTH(createDate) |                     7 |
+----------------+------------+----------------------+-----------------------+
7 rows in set (0.01 sec)

mysql> DESCRIBE PARTITIONS SELECT * FROM PartitionTest WHERE createDate > '2012-01-01' AND createDate < '2012-02-01';
+----+-------------+---------------+------------------------------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table         | partitions                   | type  | possible_keys | key     | key_len | ref  | rows | Extra                    |
+----+-------------+---------------+------------------------------+-------+---------------+---------+---------+------+------+--------------------------+
|  1 |      SIMPLE | PartitionTest | Jan,Feb,Mar,Apr,May,Jun,July | index | NULL          | PRIMARY | 24      | NULL | 307  | Using where; Using index |
+----+-------------+---------------+------------------------------+-------+---------------+---------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
In summary I am rolling out partition tables to over 1.6 TB of data across 10 shards. So far so good.

Monday, May 14, 2012

Speeding up EC2 work by using AWS tools and scripts to bypass the AWS management console

Believe me managing EC2 instances is not as simple or magical as marketers would like for you to believe. The cloud gets complicated when it gets bigger. EC2 really only enables a person to ignore power, network layout (which is bad), and getting more servers that end up costing more then actual servers fast.

Things that EC2 is missing to make life easier for the developer:

Ability to update all servers with packages and code. Natively they do not support the ability to push files or install new software packages to server groups. Thus install cluster-it and puppet and write your own deploy program.

Server names and the EC2 AWS management console do not match. Everything is referenced by instance ids. The interface does not allow one to launch many instances in a named pattern so you have to go back and sync up the instance with the internal named used in the app. Syncing up the names is very tedious and time consuming process through the console.

Assigning EBS volumes is a pain in the ass as well. Essentially you need to assign them one by one, where each add takes more then 2 mins todo this is not good time spent. For instance it took me roughly 20 mins to attach 8 63GB EBS volumes to a single server.

Amazon is perfectly aware of these limitations-and do not hide from it. They are going after something bigger. They are providing a platform. There is an API for everything you need to make your work easier. Now the pain is to learn the API and use it in your favor. There are companies that built a business on making a better interface for the AWS console but getting it done yourself is cheaper.

My personal mantra is to automate things that I have to do more then once. Anytime that I deploy new instances, I take the private IP add it to DNS, make an API call through ec2-describe-instances, find the instance id and update the name through ec2-create-tags. This solves the problem that I have with mapping instance ids to my internal name which the app uses. For instance:
 
#!/usr/bin/perl -w 
#
#
use strict;
use Data::Dumper;
open(HOSTS, "</etc/hosts") or die($!);

my $hosts = {};
while(<HOSTS>){
    my ($ip, $hostname, undef) = split(/\s+/, $_);
    $hosts->{$ip} = $hostname;
}

 
 
open(FH, "/opt/aws/bin/ec2-describe-instances -C cert.pem -K x509.pem --region us-west-1|") or die($!);
while(<fh>){
    if ($_ =~ /^INSTANCE\t(.*)/){
        my (@fields) = split(/\s+/, $1);
        # 0 - instance
        # 1 - ami
        # 2 - public dns
        # 3 - private dns
        # 4 - state
        # 5 - ??
        # 6 - ??
        # 7 - instance type
        # 8 - date created
        # 9 - DC
        # 10 - ??
        # 11 - monitoring state
        # 12 - public ip
        # 13 - private ip
        # 14 - ebs
        # 15 - ??
 
     if ($fields[4] eq 'running'){
            my $role;
            my $hostname = $hosts->{$fields[13]};
             
            if (!$hostname) {
                print "$fields[13] is not in the hosts file skipping..\n";
                next;
            }
            if ($hostname =~ /^job/){
                $role = 'gearman-worker';
            }
            if ($hostname =~ /^gearman/){
                $role = 'gearman-queue';
            }
            if ($hostname =~ /^www/){
                $role = 'webserver';
            }
            if ($hostname =~ /^memc/){
                $role = 'memcache';
            }
            if ($hostname =~ /^db/){
                $role = 'database';
            }
            if ($hostname =~ /^dbshard/) {
                $role = 'database-shard';
            }
            if (!$role){
                print "$hostname does not have a role\n";
                $role = 'other';
            }
            system("./aws/bin/ec2-create-tags -C cert.pem -K x509.pem --region us-west-1 ".$fields[0] ." --tag Na
me=$hostname --tag Role=$role");
        }
    }
}

Now to attach disks to an instance, that I am upgrading or re-purposing I wrote a quick script that describes the input instance after translating from my internal name to instance id. Calculates the size of each disk and attaches said disks. For instance:
 
#!/usr/bin/perl -w
#
#
use strict;
use Data::Dumper;
use POSIX qw(ceil);
 
print "Enter Hostname: ";
my $hostname = <>;
chomp($hostname);
my $cmd = './aws/bin/ec2-describe-instances -C cert.pem -K x509.pem --region us-west-1 --filter="tag-key=Name" --
filter="tag-value=' . $hostname . '"';
 
open(FH, "$cmd|") or die ("Awesome death: $!\n");
 
my $instance = "";
my $lastDisk = "";
my $diskCount = "";
while(<FH>){
 
 
if($_ =~ /^INSTANCE\t(.*)/){
        my (@fields) = split(/\s+/, $1);
        $instance = $fields[0];
        print "Instance=$instance\n";
    }
  
    if($_ =~ /^BLOCKDEVICE\t(.*)/){
        my (@fields) = split(/\s+/, $1);
        $diskCount++;
        $lastDisk = $fields[0];
        print "$lastDisk\n";
    }
}
 
print "How many disks you would like to add: ";
my $totalAddDisks = <>;
chomp($totalAddDisks);
 
print "You picked $totalAddDisks\n";
print "What is the total size of the Raid0 Array in GB: ";
my $totalSize = <>;
chomp($totalSize);
 
print "You picked $totalSize GB\n";
my $sizeperdisk = ceil($totalSize/$totalAddDisks);
print "The size per disk: $sizeperdisk\n";
 
$lastDisk =~ /sd(\S)/;
my $lastDeviceLetter = $1;
my @devicesavail = ($lastDeviceLetter .. 'z');
 
for(my $i = 1; $i <= $totalAddDisks; $i++){
    $cmd = "./aws/bin/ec2-create-volume --size $sizeperdisk --region us-west-1 --availability-zone us-west-1c -C cert.pem  -K
 x509.pem";
  
    my $ret = `$cmd`;
    my (@output) = split(/\s+/, $ret);
    
    $cmd = "./aws/bin/ec2-attach-volume --region us-west-1 -C cert.pem -K x509.pem $output[1] --instance $instanc
e --device /dev/sd$devicesavail[$i]";
    $ret = `$cmd`;
    
    if ($ret =~ /attaching/){
        print "All good do the next one\n";
    } else {
        die("Did not work\n");
    }
}

These are rough and dirty scripts that get the job done for my environment. The end goal when given time is to turn these scripts into a package talking over httpd that makes life easier when working in EC2. Using these two script have reduced the management time from 1 hour per server upgrade to a few minutes.

Monday, May 07, 2012

mySQL 5.6 looks very promising

I have been doing some research lately on various mySQL related features and found myself distracted by mySQL 5.6. If everything that I am reading is correct, this should be the best mySQL version yet.

mySQL 5.6 will have multi thread replication, making time delayed slaves-an actual feature and not a byproduct of replicating high write volumes. Also, crash safe replication! When a master crashed and corrupted the binary log this was a pain in the ass to fix. In the past I had to write scripts to walk the primary key and do a checksum on each returned table chunk and pick which row was correct and which one was not.

The performance optimization for innodb that addresses some stalls is as exciting as multiple SQL threads for replication. One major change is in the stat layer, which was throttling throughput for in memory workloads at high thread concurrency. You can read the details here.

The optimizer is getting an overhaul that has been needed for some time. This is exciting but by habit I still will probably just use (FORCE|USE) INDEX. Here is an example of using FORCE INDEX to get better performance.





This stat shows innodb hitting the disk. A single query was flopping between two indexes producing 33% MORE disk reads then necessary. Making a single line change on that query gave me more capacity in EC2. (Don't hit the disk in EC2). In theory, the changes to the optimizer will reduce these flip-flops of indexes - but I doubt it will be as good as a human picking the index for their query.

Finally more and more stats from INFORMATION_SCHEMA which will probably create a flurry of bugs on the 1st couple days of release as the database hits large installs, and unexpected workloads exposes some stuff in the overall code - as everyone will start using information more to show dips in throughput.


In conclusion, 5.6 looks awesome and I can't wait for a Percona Port.