Wednesday, October 29, 2008

INNODB Shared Locks, Exclusive Locks and INSERT INTO SELECT FROM

I wrote an application that is able to send out 3-8 million messages an hour with only 10 CPU's. This application is a part of an Offline Task system that scales linearly.

How is this done, I'll go into detail hopefully at the mySQL conference if they accept my proposal.

The scope of this blog post is to go over building the "Task Queues". Currently I have 13 Queues, one queue for each Shard that I run. The data is federated by user or randomly federated with a GUID that lives as long as the job. A request came in to add 20 million jobs to the queue all at once. The problem is with this list, will I cause deadlocks in innodb as I add the jobs to the queue as one transaction? Can live traffic still write to this queue?

To verify that Deadlocks will not occur - having an understanding about how locks work in INNODB is key. I suggest reading this page.

To build the queue I dumped the data source by

SELECT identifier, 16 INTO OUTFILE "/data/mysql/BuildQueue.log" FROM SOURCE_TABLE WHERE CONDITION.


The isolation level is REPEATABLE-READ; I'm setting a shared lock to get the most current version of the data. Writes are not blocked.

Next:

I create a table on each shard where the queue is located.

CREATE TABLE IF NOT EXISTS OfflineTasksHold (
object_id bigint(20) NOT NULL DEFAULT 0,
object_type smallint unsigned NOT NULL DEFAULT 0,
PRIMARY KEY(object_id,object_type)
) ENGINE=INNODB;



Then I issue a command on each Shard


LOAD DATA INFILE "/data/mysql/BuildQueue.log" INSERT INTO TABLE OfflineTasksHold;
START TRANSACTION;
INSERT INTO OfflineTasks (object_id, object_type) SELECT * FROM OfflineTasksHold;
COMMIT;


Each shard is getting around 4-5 million rows, while accepting real-time traffic of 20-60 tasks a second to the OfflineTasks table. The OfflineTasksHold table does not have any real-time requirements and is solely used to keep the queue in Primary Key order, plus there is a CHANCE that LOAD DATA could set an exclusive lock on the OfflineTasks table-shutting down adding data to the table by the live site. The INSERT sets an exclusive lock on the rows that are being added, so the Offline Task Sheppard - the process that pops tasks off the queue is blocked for a small period - which is acceptable. Why are they blocked? Well, the massive insert sets an Exclusive Lock. The Sheppard is trying to grab the rows that are locked waiting up to 50 seconds, until innodb_wait_timeout is reached. This condition is acceptable. All other inserts are able to go into the queue without a slow down.

In summary, the job queue is built and can be automated with confidence knowing that death to the various app will not occur. Processing slows down for a bit but speed right back up.

Monday, October 20, 2008

mySQL Disk Benchmarks with Sysbench. When will mySQL / INNODB stop scaling?

Today's "commercial quality disks" are amazing but they follow the same limitations as yesterday's disks. mySQL scales very well, but disks do not. So if you're IO bound when will your expectation of speed fail?

Test setup:

DELL 2950 PERC-6 HWRaid BBC 6 DISK 15K RPM 3.5" RAID-10 256K stripe across two channels-using WRITE THROUGH CACHE on mkfs.ext3 -T largefile4 Linux Filesystem.


The theory is that the outer part of the spindles is the fastest, and the inner portion is slower - since the outer is where the data starts (thanks for the info Benjamin Schweizer). Thus one can conclude that the more disk space your application(s) use the slower the throughput, since the heads have to move more. Brad F. my co-worker did a benchmark to prove this. Our goal is to find out at what is the saturation point if our expectation is to have 22 MB / sec of random access.

Why do we want 22MB / sec of random access throughput? We want to guarantee a certain level of performance when adding new apps to a common backend-which is I/O bound: we need to know when things will break.

Here is what Brad found: Total disk size for our RAID-10 setup =~ 800G. What point does it FAIL to achieve our expectations of sustained 22MB/s?

rndrw test across 100G test / 750G LV =~ 35 MB/s # outer part of the spindles
rndrw test across 100G test / 300G LV =~ 32 MB/s # outer part of the spindles
rndrw test across 250G test / 300G LV =~ 24 MB/s # sweet spot
rndrw test across 350G test / 384G LV =~ 21 MB/s # saturation point
rndrw test across 750G test / 800G LV =~ 14 MB/s # waste of space


In conclusion these test show that even though a RAID-10 setup with 800G of space is available, the expected performance drops when data exceeds the sweet spot of 250G-300G of 800G usable-data array.

Disclaimers: There are many factors that can raise or lower the bar, like different file systems, different I/O schedulers, flushing. For my setups I like

Deadline I/O scheduler
256K Stripe
few inodes (don't need them)
ext3 since that’s what stable and available.