Monday, May 21, 2012

mySQL Partitioning summary and what to watch out for

I have been looking for efficient ways to purge data from "Fast Disks" for applications that are time based and do not look at data after a time window has passed. For instance keeping a table where one stores log data from access logs and the data is okay to roll up the data and throw away the details every month. Another example is keeping a log of invites got a Facebook user from a Facebook user. Since the Facebook News feed is time based, stream publish postIds from 4 months ago do not necessarily need to be hot-the user rarely looks 4 months in the pass. Additionally with Timeline, it is hard to see a single story from 10 months ago if it is not as important as other stories. Keeping the data is necessary to delete Facebook posts from your application. So, get rid of each Invite off fast disk, and archive it on slow disks just in case you need to see the data again.

One solution is to code some logic into the application to switch SQL tables based on the create date of the row and map it to a table. This works but its not very clean.

Another solution is to walk the table in question and delete the old rows with DELETE statements. This is not good. The reason is DELETE IS VERY slow. In fact, the rule of thumb is this. You can do 10000s of SELECTS per seconds 10000s of Updates per second 1000s of Inserts per second, 10s of deletes a second. (I am hand waving and being general here). Additionally for INNODB (from this point just assume I am putting everything in the context of using INNODB) DELETES undo references are stored in the master ibdata file for innodb growing said data file unbounded over time. Ever notice that it grows even with innodb_file_per_table on? That's from deletes actions and is useless data. Additionally the only way to shrink that file is to do a full export to text and import after blowing away the master ibdata file(s).

Another solution is a mySQL partition table, think of it as a layer that sits in-front of the storage engine and relies heavily on the optimizer.  The partition table maps statements based on the partition setup to the correct underlying tables.  So, if you have 12 partitions for a table called FacebookInviteHistory, there will be 12 STORAGE ENGINE TABLES for the partition table FacebookInviteHistory in the format of


Where FacebookInviteHistory is a pointer table in the format of


The benefits of using partition table is the following:
  • Underlying tables can be assigned to specific disk media. 
  • Aggregate functions such as sum and count can easily be parallelized across all partitions providing very quick access to results. 
  • Data that loses it usefulness can be easily removed by dropping the partition containing only that data. 
  • Finally queries can be greatly optimized by the fact that data satisfying a given where clause can be stored only on 1 or more partitions which automatically excludes looking at other partitions-this is called pruning.
These are benefits  IF you set up partitioning correctly AND the partition engine gets enough info from the optimizer to pick the correct partition, else by default it queries ALL partitions. Querying all tables messes up the rule of thumb by reducing that 10K select number to 1K even if the data is not in the other partitions due to the fact of wasted iops and traveling the btree to be told that "Dude the data is not here".

This being said it still looks like a good feature to use but I have some questions which Google could not answer for me.

Question: Does adding or dropping partitions lock other partitions?
Short Answer: No
Long Answer:

For this table:

CREATE TABLE PartitionTest (
 senderId bigint(20) unsigned NOT NULL,
 recipientId bigint(20) unsigned NOT NULL,
 createDate datetime NOT NULL DEFAULT 0,
 PRIMARY KEY (senderId, recipientId, createDate)
    PARTITION Jan2012 VALUES LESS THAN (TO_DAYS('2012-02-01')),
    PARTITION Feb2012 VALUES LESS THAN (TO_DAYS('2012-03-01')),
    PARTITION Mar2012 VALUES LESS THAN (TO_DAYS('2012-04-01')),
    PARTITION Apr2012 VALUES LESS THAN (TO_DAYS('2012-05-01')),
    PARTITION May2012 VALUES LESS THAN (TO_DAYS('2012-06-01')),
    PARTITION Jun2012 VALUES LESS THAN (TO_DAYS('2012-07-01')),
    PARTITION Jul2012 VALUES LESS THAN (TO_DAYS('2012-08-01'))

I wrote a benchmarking tool that simulates my type of concurrency writing at double the rate. Think of this tool as mysqlslap but specific for my table structures and load. Next I ran an administration command to add a partition drop it and add it again.

mysql> ALTER TABLE PartitionTest ADD PARTITION (PARTITION Aug2012 VALUES LESS THAN (TO_DAYS('2012-09-01')));
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE PartitionTest DROP PARTITION Aug2012;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE PartitionTest ADD PARTITION (PARTITION Aug2012 VALUES LESS THAN (TO_DAYS('2012-09-01')));
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0

QPS from the tool did not drop.

Question: Can you create a circular queue of Partition tables?
Short Answer: No in 5.1, Yes in 5.6 sort of.

Long Answer:

Lets imagine you wanted to create 12 partitions with a RANGE ON MONTH(createDate) each month would map to one of the 12 partitions. The goal is to drop data that is over two months old, so on June 1st-Jan to April's data can be dropped. This does not work in 5.1. In 5.1 you need to add ranges that is greater then the previous range. In 5.6 you can truncate a partition. But in both mysql versions pruning just does not work on a range on MONTH of createDate so its just not an efficient partition for reads. For date and datetime fields TO_DAYS and YEAR, are pruning friendly. In 5.6 TO_SECONDS is pruning friendly. For some good reading check out Mikael's blog on 5.1 and 5.6 date columns and pruning. Try not to use timestamp columns when partitioning data. Its been very buggy in the past and does not work well with pruning. Even though you may save 4 bytes per row with timestamp the work is just not in place to partition TIMESTAMP COLUMNS.

Question: Can you force reading from a partition?
Short Answer: Yes with some work in 5.1 but in 5.6 the feature is better.

Long Answer: If you know how the data is partitioned you can construct SQL to read the entire partition in 5.1. In 5.6 there is a command to read from a partition.

Question: For a partition of months why not use a LIST Type Partition? *From a comment below but made pretty*
Short Answer: datetime doesn't support MONTH for pruning.
Long Answer: Pruning on Lists for datetime is only optimized for TO_DAYS, YEAR according to the documentation. Every partition would be queried.


CREATE TABLE PartitionTest (
   senderId bigint(20) unsigned NOT NULL,
   recipientId bigint(20) unsigned NOT NULL,
   createDate datetime NOT NULL DEFAULT 0,
   PRIMARY KEY (senderId, recipientId, createDate)

|            Jan |        101 |    MONTH(createDate) |                     1 |
|            Feb |        101 |    MONTH(createDate) |                     2 |
|            Mar |        101 |    MONTH(createDate) |                     3 |
|            Apr |          0 |    MONTH(createDate) |                     4 |
|            May |          0 |    MONTH(createDate) |                     5 |
|            Jun |          0 |    MONTH(createDate) |                     6 |
|           July |          0 |    MONTH(createDate) |                     7 |
7 rows in set (0.01 sec)

mysql> DESCRIBE PARTITIONS SELECT * FROM PartitionTest WHERE createDate > '2012-01-01' AND createDate < '2012-02-01';
| id | select_type | table         | partitions                   | type  | possible_keys | key     | key_len | ref  | rows | Extra                    |
|  1 |      SIMPLE | PartitionTest | Jan,Feb,Mar,Apr,May,Jun,July | index | NULL          | PRIMARY | 24      | NULL | 307  | Using where; Using index |
1 row in set (0.00 sec)
In summary I am rolling out partition tables to over 1.6 TB of data across 10 shards. So far so good.


Sheeri K. Cabral said...
This comment has been removed by a blog administrator.
Dathan Pattishall said...
This comment has been removed by the author.
Aftab Khan said...
This comment has been removed by a blog administrator.
Nandkishor Wagh said...
This comment has been removed by a blog administrator.
vijay bhusani said...
This comment has been removed by a blog administrator.