Thursday, September 15, 2011

Amazon EBS mySQL, Disk Throughput and the Dual Edge of Software Raid

Amazon's EBS system is just a nice interface to a SAN subsystem, which manages the attachments of SAN LUNs. The problem with SAN when compared to Local SAS drives is latency and the shared controller, which caches IOPS for very distinct load profiles. Each load profile has an "optimized" cache profile from the SAN's redundant controller system. You may be able to attach petabytes of disks, but this system cannot utilize the true throughput when compared to small locally attached SAS Drives. Now the management portion of awesome. I love having the ability to mount more disk but I rarely need more space, I need speed.

How to get Speed out of Amazon's EBS volumes: Software RAID it!
mdadm --create /dev/md1 -v --raid-devices=8 --chunk=256 --level=raid10 /dev/xvdk /dev/xvdl /dev/xvdm /dev/xvdn /dev/xvdo /dev/xvdp /dev/xvdq /dev/xvdr

Take 8 EBS 125 GB volumes create a raid10 array with a 256KB chunk size. After various and mind numbing benchmarks I found that 256K is a good sweet spot. Feel free to do your own benches. The results have to be interpreted because of the nature of using a shared resource.

What I end up with is a 500GB partition, and I am roughly able to get around 22-25 MB of second of random I/O from 20 threads. To compare this to an 8 DISK 15K RPM PERC-6 2.5" SAS system I am able to get around 44 MB of second at a constant 1-2 ms response time for the same physical space. EBS volumes Response time per iop range from 6ms to 200ms. This sucks. Note: these numbers are based on RANDOM I/O 16KB Page size (4 iops per block write), what INNODB uses not sequential I/O.

Here is some iostat numbers from a live box with this configuration
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.83    0.00    1.75   22.32    0.08   74.01

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
xvdap1            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
xvdh              0.00     0.00    0.00    1.00     0.00     8.00     8.00     0.00    0.00   0.00   0.00
xvdk              0.00     0.00   34.40   26.40  1100.80  1503.20    42.83     0.49    8.01   6.39  38.88
xvdl              0.00     0.00   13.20   26.40   422.40  1503.20    48.63     0.27    6.71   4.38  17.36
xvdm              0.00     0.20   32.40   27.00  1036.80  1524.20    43.11     0.30    5.13   4.19  24.88
xvdn              0.00     0.20    9.40   27.00   300.80  1524.20    50.14     0.15    4.11   2.48   9.04
xvdo              0.00     0.00   30.20   27.40   968.00  1496.80    42.79     0.45    7.76   6.56  37.76
xvdp              0.00     0.00   14.60   27.40   478.40  1496.80    47.03     0.22    5.26   3.92  16.48
xvdq              0.00     0.00   31.20   25.60   998.40  1501.60    44.01     0.38    6.73   5.32  30.24
xvdr              0.00     0.00    9.80   25.60   313.60  1501.60    51.28     0.16    4.50   2.35   8.32
md1               0.00     0.00  174.80   98.60  5606.40  6009.80    42.49     0.00    0.00   0.00   0.00

So, now that I have acceptable speed what is the drawback? A weekly cron job that runs a check across the raid array. On Amazon’s EBS system it cuts my throughput in 1/2

For my Amazon Linux system the cron job is located
-rwxr-xr-x 1 root root 2770 Jan 16  2011 /etc/cron.weekly/99-raid-check

It essentially runs

echo check > /sys/block/md1/md/sync_action

Yet, the check lasts for around 9000 min or 6.25 days! Thus I will only have .75 days of full throughput.

So to stop this I must run
echo idle > /sys/block/md1/md/sync_action

I do not recommend turning off the check, its needed. Now to find out a way to make this check happen faster.

No comments: