Wednesday, February 01, 2012

Basic ETL with Gearman and MySQL in a few lines of PHP code

Gearman is awesome. If you do not know what it is, its a queue and load balancing system for an arbitrary number of workers which enables distributed computing across many nodes. Some of the same guys who worked on mySQL source worked on Gearman.

Feel free to search my blog on other gearman uses.

The Problem:

We store a lot of stats, make a lot of changes and we want to see the result of the stats in realtime. Our stat system is pretty slick. For each tag increment the application increments a count and group said tag by minute, hour, month with a hash tag numeric representation of the text for compact writes. This means 1 tag write produces 4 SQL statements. We track over 239211 distinct tags at around 10K Writes per second on a single mySQL instance on EC2 in a RAID-10 EBS xtra-large Config (although because EBS mirrors internally I can just raid-0 but I was too scared).

Once the mySQL instance hits disk (EBS) our throughput becomes very unstable, possibly slowing down the site. The solution was to defer these writes but how can I do it without building a logging system, aggregator, loader and having a bunch of moving parts? Really I want to only spend 10 mins on this problem and use existing monitoring code.  So my 10 min solution:

 3 mins to write the code
 6 mins testing
 55 seconds of patting myself on back
 5 seconds to deploy

Solution Detail:

Since Gearman workers connect to the GearmanD server's Job QUEUE and loop for more Jobs. This means program stays in memory for the length of the process (until worker restart). This means I can transform the data in application memory. Since the program is persistently connected to the DB that means I can periodically load the data in chunks.

In stead of having 100s of possible concurrent connections doing writes I can control the writes based on the number of workers. Innodb is very fast and consistent at low levels of concurrency (less then 50).

Since I can drain the queue from GearmanD and transform the data locally I do not really need to worry about running out of memory on the queue server. The consumer is faster then the producer. I can combine 1000s of writes into a single write.

Let's look at some code:
<?
    require_once("includes/config.php");
    require_once("includes/DB/EventTrackerDB.php");


    class EventTrackerETL {
       
        //
        // keep track of distinct tags
        //
        public static $eventTable  = array();

        //
        // the next flush
        //
        public static $nextWrite   = 0;

        //
        // keep a stat of total writes
        //
        public static $totalEvents = 0;

        //
        // total number of events
        //
        const MAX_NUMBER_OF_EVENTS = 30000;

        //
        // number of seconds to pause
        //
        const FLUSH_INTERVAL = 20; // seconds

        /*
         * transform all the tags into a sum of the counts entered
         *  @params string $event    - the tag being incremented
         *  @params int $count       - the supplied count many times its just 1
         *  @params int $timeEntered - EPOC timestamp
         *  @retuns void
         */
        public static function transform($event, $count, $timeEntered) {

            //
            // initialize or increment a tag
            //
            if (isset(self::$eventTable[$event])){
                self::$eventTable[$event] += $count;
            } else {
                self::$eventTable[$event] = $count;
            }
            
            //
            // flush if we hit the max number of events
            //
            if (sizeof(self::$eventTable) > self::MAX_NUMBER_OF_EVENTS){
                return self::load();
            }

            //
            // flush if its our time
            //
            if (self::$nextWrite < time()){
                return self::load();
            }
            return;
        }

        /*
         * flush the stored tags to the database
         *  @returns void
         */
        protected static function load() {

            //
            // write transformed events to the db
            //
            $thisRun = 0;
            foreach(self::$eventTable as $event => $sum) {
                EventTrackerDB::singleton()->updateEvent($event, $sum);
                $thisRun++;
                self::$totalEvents++;
            }

            $msg = "EventTracker write complete $thisRun events this run and a total of " . self::$totalEvents . " events written so far";
            Debugger::log("OT", $msg);

            //
            // re-init
            //
            self::$nextWrite = time() + self::FLUSH_INTERVAL;
            self::$eventTable = array();
        }
    }
?>

In summary with gearman I am able to process 250K events in seconds. The queue never builds up and there is special code to handle kills (not SIGKILL).

Tuesday, January 17, 2012

mySQL Column Types and Why it Matters.

MySQL is awesome at converting strings to integers when comparing column lvalues with converted rvalues. So much so that many of us take this fact for granted. When does this assumption break down? When does passing in the wrong value cause problems in mySQL? Let's take a table EmailLookup for example.
CREATE TABLE `EmailLookup` (
  `userId` bigint(20) unsigned NOT NULL,
  `email` varchar(128) NOT NULL,
  `emailCrc32` int(11) unsigned NOT NULL,
  `createDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`emailCrc32`,`userId`),
  KEY `createDate` (`createDate`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
The primary key is emailCrc32, userId for a key size of 12 bytes (4 bytes for int, 8 bytes for bigint). Since this is a compound key (key with two columns), following the rules of Left Most Prefix I get two index lookup types for one. emailCrc32 and userId is a unique index lookup while emailCrc32 is also an index lookup. Thus I can do
SELECT email FROM EmailLookup WHERE emailCrc32 = ? and userId =?
OR
SELECT email FROM EmailLookup WHERE emailCrc32 = ?

What happens if I pass emailCrc32 a numeric string. i.e.
mysql> SELECT email FROM EmailLookup WHERE emailCrc32 = '1';

Empty set (0.00 sec)
So cool, works and comes back super quick. What happens if I pass emailCrc32 a real string. i.e.
mysql> SELECT email FROM EmailLookup WHERE emailCrc32 = 'a';
Empty set, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+---------------------------------------------------------------+
| Level   | Code Message                                                         |
+---------+------+---------------------------------------------------------------+
| Warning | 1366 | Incorrect integer value: 'a' for column 'emailCrc32' at row 1 |
+---------+------+---------------------------------------------------------------+
1 row in set (0.00 sec)
Comes back quick. What happens if I pass the column a big int
mysql> SELECT email FROM EmailLookup WHERE emailCrc32 = 100003256490710;
Empty set, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+-----------------------------------------------------+
| Level   | Code | Message                                             |
+---------+------+-----------------------------------------------------+
| Warning | 1264 | Out of range value for column 'emailCrc32' at row 1 |
+---------+------+-----------------------------------------------------+
1 row in set (0.00 sec)
Comes back quick. But, what if I do a DELETE on a 64-bit server?
mysql> DELETE FROM EmailLookup WHERE emailCrc32 =  '100003256490710';

 Query OK, 0 rows affected (2 min 14.16 sec)
WHAT? A 2 min query for an impossible DELETE? Notice that '100003256490710' is a string. What is happening to INNODB
mySQL thread id 90620794, query id 3745441316 10.170.22.169 schoolfeed updating
DELETE FROM EmailLookup WHERE emailCrc32 = '100003256490710'
TABLE LOCK table `Shard1`.`EmailLookup` trx id 2AB524B6D lock mode IX
RECORD LOCKS space id 160 page no 5626 n bits 248 index `PRIMARY` of table `Shard1`.`EmailLookup` trx id 2AB524B6D lock_mode X locks rec but not gap
RECORD LOCKS space id 160 page no 7580 n bits 240 index `PRIMARY` of table `Shard1`.`EmailLookup` trx id 2AB524B6D lock_mode X locks rec but not gap
RECORD LOCKS space id 160 page no 6039 n bits 280 index `PRIMARY` of table `Shard1`.`EmailLookup` trx id 2AB524B6D lock_mode X locks rec but not gap
RECORD LOCKS space id 160 page no 455 n bits 352 index `PRIMARY` of table `Shard1`.`EmailLookup` trx id 2AB524B6D lock_mode X locks rec but not gap
RECORD LOCKS space id 160 page no 3174 n bits 288 index `PRIMARY` of table `Shard1`.`EmailLookup` trx id 2AB524B6D lock_mode X locks rec but not gap
RECORD LOCKS space id 160 page no 5997 n bits 304 index `PRIMARY` of table `Shard1`.`EmailLookup` trx id 2AB524B6D lock_mode X locks rec but not gap
RECORD LOCKS space id 160 page no 1486 n bits 296 index `PRIMARY` of table `Shard1`.`EmailLookup` trx id 2AB524B6D lock_mode X locks rec but not gap
RECORD LOCKS space id 160 page no 5607 n bits 280 index `PRIMARY` of table `Shard1`.`EmailLookup` trx id 2AB524B6D lock_mode X locks rec but not gap
RECORD LOCKS space id 160 page no 2729 n bits 312 index `PRIMARY` of table `Shard1`.`EmailLookup` trx id 2AB524B6D lock_mode X locks rec but not gap
TOO MANY LOCKS PRINTED FOR THIS TRX: SUPPRESSING FURTHER PRINTS
Yikes this is bad. Is this a bug? Maybe but is also a condition that should not happen if types are respected. The moral of the story is if your application respects column types mySQL will respect you :) This is from Server version: 5.1.57-rel12.8-log Percona Server (GPL), 12.8, Revision 233

Tuesday, December 20, 2011

The Effect of using Cloudfront and why it matters

For years (12+) I have been building systems on every tier of the web. Everything from low-level OS optimizations, mySQL internals, interpreted language performance tricks to static content optimization.

Building CDN's are easy, but what makes Akami or Cloudfront attractive-presences known as edge nodes-they have around the world to syndicate your content closest to the requester. 

Their smart DNS servers send people to the closest edge node to serve content. This is great for serving Javascript, Images CSS (Video) because its static. 

Here is a good example. Your system can serve content in less then 5ms if the network is not involved. With the network overhead that content is served in 10ms time if you are close to the DC (say about 1200 miles). Yet your users on the east coast (assuming your dc is in the west coast) or better yet your users in Europe see this content in 355 ms. Around 300 ms or so users start to notice the lag; as a result this lag is proportional to increase in chances that the user will bounce. People hate waiting. Now do you optimize the backend to serve the content faster or do you put the content closer to the end user?

Put the content closer to the end user to reduce the 350ms back down to 10ms. This is what cloudfront-an amazon product does for you. Here is a good wiki page to setup cloudfront. I've expanded on this to add some Apache Mod Rewrite rules to automate cache invalidation so I don't have to call an API to purge the CDN cache. Below is the mod rewrite rule


    RewriteRule ^/static/(\d+)/(.*)? /static/$2 [NC,QSA,L]
    RewriteRule ^/static/(\w+)/(\d+)/(.*)? /static/$1/$3 [NC,QSA,L]
    RewriteRule ^/static/(\w+)/(\S+)/(\d+)/(.*)? /static/$1/$2/$4 [NC,QSA,L]

what this says is 

given a url

http://domain/static/12345/main.js

serve from

http://domain/static/main.js

The dynamic url is generated by taking the abs crc32 of the file contents of each file. So if a file changes on disk so does the url breaking the cache and forcing cloudfront to refetch the content to display to the user. All this is calculated once during the deploy process.


For instance

http://d1wuzpn2rb4qzi.cloudfront.net/static/jquery/240184024/jquery.min.js

http://d1wuzpn2rb4qzi.cloudfront.net/static/jquery maps to http://your.schoolfeed.com/static/jquery

240184024 - is the cache breaker by doing this during the deploy process

$hashes[$file] = abs(crc32(file_get_contents($file)));

then generating php code that is global to the templates

$str = "";

Now when building the reference link

script type="text/javascript" src="{cloudfront file='/static/jquery/jquery.min.js'}"

{cloudfront} is a smarty function that takes the input file and splits the directory putting 240184024
into the path for the cloudfront url



So what has this done on the system.


Around the 5pm hour we see a drop in www_accesses, that's due to switching to Cloudfront. There is nearly 60% in savings. schoolFeed has a lot of javascript files. Additionally there is still another optimization that can be done to group javascript files together to reduce the amount of GETS.


Here we see a 35% drop in bytes out as a result of the change.

And the affect on mySQL is about 3-5% more traffic on the backend as users stick around longer since things are snapper.



Always keep this in mind. As one tier becomes faster or more performant the other tiers should have the capacity to keep up with demand.


Tuesday, November 22, 2011

Using live code interrupts to produce stats which in turn improves code

How do you know that your code is fast? Is it fast for your test cases or is it fast for every case? When changes are made how does that affect your customers? How do you know over a period of time if the system is faster or slower.

The same stat system which is used to track new installs, viral clicks, impressions, rates,  funnels, page flows, gauges, counters, etc is also used to let me know how fast code blocks are performing. How is this done?

Since a Front Controller Design Pattern is used for my AJAX calls, I am able to wrap all calls in time deltas to produce a centralized stat on how fast all service calls are taking. So, each code change that is pushed; I can see if that code change slowed something down, or broke something all together.

Here is the setup. Each Service Call Response Time is placed into buckets; less than 200ms, 201ms to 500ms, 501ms to 1 second, 1.001 second to 2 seconds and greater then 2 seconds. Anything over 500 ms is bad.


On the 13th of November I see that service calls for the 2 second+ bucket are on the rise. Now the next step is to look at independent data sources to determine is this a system issue or a code issue. This is the tricky part because code issues can cause system issues. I use ganglia to separate out the system stats from code interrupt stats and interpret the results.
 
Now the System stats I look at to help me drill down to the issues are things that  encapsulate the problem that I am investigating.  For instance, service calls are a product of CPU and Memory usage from the www teir. These system calls use network resources that talk to databases, memcache and queues. Thus I ask myself what are some high level system metrics to help me figure out if there is a system issue? Well apache has a lot of high level stats like busy workers and requests per second. Busy workers are a product of memory, cpu and network resources. This is a good stat.




Here busy workers are seen to be increasing but not all that much. What is the slow down? Let's look at another ganglia stat - Requests per Second.




Wow the request per second is going up, Busy (above) is growing slightly; with hiccups - thus what I need to look at is the data tier since the wwws are scaling well (Higher request rates with the same www busy rate).

Unfortunately the graphs for that are plagued with huge spikes caused by long term issues in RRD, so they are not show-able. A long story short, an increase in requests are from a sudden spike of new users and returning users; causing more concurrency on the backend exposing memcache evictions because of more active users. Thus I added more memcache servers and scheduled a rebalance of database data onto new servers; this should lower overall the service times on average without code changes.

In summary, realtime code insights are just as valuable as knowing the number of installs or clicks from emails. You can do all sort of stuff like draw custom dots for code deployments that correlate with response times. Having this data is invaluable to keep your site fast at very little cost while giving you knowledge of the entire system.

Tuesday, October 25, 2011

Handling the Hockey Stick Growth

The term hockey stick is used to describe the effect of an app that suddenly goes viral. Take a look at the graph to the left. There is modest growth and suddenly, the app goes viral and takes off. It looks like a hockey stick, ala the term.

This article is briefly going to touch on the points of how to handle the sudden growth at the lowest cost possible with a site that I helped build: schoolFeed.com-a social network that reconnects classmates for Free.

The main features of schoolfeed.com is reconnecting classmates, ensuring that each classmate is well connected, a feed to keep classmates in touch with one another and interests that each classmate share. Additionally there is a photo experience to share with your online yearbook and more features to come.

To handle the growth, enable rapid feature development, keep the site up without waking me up, and keeping it cheap means a set of structure needs to be put in place.

Suggestion #1: Keep the system architecture simple
The architecture consists of PHP on the front end, Memcache to front database queries, a database on the backend, a queue service-Gearman to handle offline processing in parallel; finally sendgrid to handle mail.
Suggestion #2: Keep the development environment simple
The development environment did not start off too abstracted. A simple MVC model is used where the Model fronts the PDO database objects structure. The Controller is the service layer which is a Front Controller design pattern, and the php entry points to handle the model inputs. The View is in smarty because keeping the presentation layer separate from the business logic is pivotal. Additionally this View is separated enough to replace smarty and or internationalize the strings in the future. Also JQUERY is used to make life so much simpler when supporting IE.
Suggestion #3: Monitor everything
I use Nagios for alerting (Icinga), Ganglia for Trending, and a custom stat system backed by mySQL for reporting on code interrupts, click through rates, feature adoption, K-Factor,  DAU per feature, MAU, WAU, Facebook Platform Health, site response time, site api response time, email send rate.
Suggestion #4: Design every layer to be distributed.
If I run out of apache threads, I add more www servers. If my memcache eviction rate is to high, I add more memcache servers. If I need more database transactions per second I add more database servers and each layer is controlled from a config file enabling rapid deployment of servers to handle spikes in traffic. Since the database connection logic is controlled by the application, I drop a definition in an array and new traffic starts hitting a new database server. If the existing database server is loaded to much and I need to move data off of it. I take a xtrabackup of the server replicate it to a new server, set the pointer for a % of that traffic to the new server and clear up the old data on the original server. Or I can migrate individual entities. An entity is a user/school/interest/url/facebook id/etc.

Suggestion #5: Don't optimize to soon.
The goal is to make each feature super fast, but building a super abstracted layer to support 1000s of devs is only necessary when you have 10s of devs :). Please don't interprete this as me advocating being sloppy-I'm saying its cool to allow your team to interact with SQL and write their own :). Additionally building custom servers to handle specific tasks, changing languages to get a specific feature is really not necessary in the beginning. Supporting the product and ensuring the features do not take more then 200ms to generate or weeks to build said feature. This should be the focus to enable the hockey stick. In the early stage of a hockey stick; technology rarely is the cause for the growth-its building what your users may want and when your wrong throw that stuff away and actually build what they want. A helpful tool is to build features in a way where the feature or parts of the feature can be turned it off with a config change. This will save you a ton of headache without having to take the site down, while enabling pushing code out quickly and watching to see if its adopting prior to optimizing.

Suggestion #6: Plan for things to break and set up procedures to handle outages
Things will break. The goal is to hide this fact from users or inconvenience them as little as possible. Schedule maintenance windows to fix the heavy stuff. Have a playbook to handle outages, if the play does not exist-write the play down. Finally automate common tasks. Remember if you don't want any user experiencing an outage-that costs a lot of money. Redundancy is expensive. Multiple Redundancy in multiple datacenters is even more expensive.
I hope these steps help you in your projects in the future. I have had the pleasure of handling multiple hockey sticks and following a basic rule/suggestion set has helped me each time. The end goal really is to give a great experience for your users, build a clean environment for your devs with your devs input and improve the product rapidly.

Some stats: 3-5 web servers, 2 job boxes 2 database servers we are able to handle well over 100K DAU.

Thursday, October 20, 2011

Facebook should launch FBCloud and compete directly with Amazon, Google and others



Another Facebook should do post by someone outside of Facebook; but it’s a moneymaker that Facebook has not tried and probably has the best chance of succeeding at (not like deals har har - jab, jab). Some of the best DEV Ops work at Facebook. Facebook knows scale. Facebook knows system management. Facebook built the most advance data-center on the planet. This stuff is great it shows that they can do it, but what’s the motivation for the app developer to deploy in a Facebook Cloud? Simply put Latency. This is the real issue, for me, really a selfish desire to have my app move as fast as Facebook's Apps while using Facebook Data; a seamless integration if you will.  For Facebook, it’s good because they can help me make my users happier while making tons of Cash. If my app is in the same data-center as the center of data, my app can move faster thus giving my users a better experience.

Here is an example. When doing a graph call for Facebook friends, the backend systems can do it in ms time yet the JSON reaches the caller in the 100ms time frame over the WAN from my servers in EC2-west1c to Facebook Servers in Oregon. If I'm in the data-center that holds the data (Oregon) my app speeds up 10 times, since that 100ms R(t) turns to 5-10ms.

Additionally Facebook houses some of the most advance tech that lots of people around the web use. Such as MEMCACHE. Facebook could manage that for you. In fact they have PETA BYTES of memory for their own app with automatic key management between DCs (wow). Offer Facebook Hosted MySQL with Flash Cache for High density IOPS. Each Facebook DB server has Solid State Disks, use that to buffer IOPS for subscribed developers to FBCloud. With their tools to automatically migrate data to another server, building new Instances would be a snap without having to use a SAN. Although you could use SSDs to buffer SAN writes/reads for easier management with great R(t). Facebook stats on HBASE, Facebook Varnish, a fast CDN-everything that they do as commercial product. I've seen their tools, its better then enterprise quality and nearly all of them have an API. Facebook Culture is platform focused. I assume if you don't build an API for your tool your mocked.

How would Facebook make money? Charge on CPU resources just like Amazon. Charge on IOPS, charge on managed Memcache size, charge on Data size. Charge for BCP. With this adding to Facebook's platform, Facebook could make money on the Front End from Ads, on currency and finally on the API indirectly while giving the End User an entire platform guaranteed to be fast and redundant in multiple data-centers.

I have an entire vision that would make a ton of cash, but really would make my users and me as a developer happier.

PS This post went out to fast, with grammar and spelling mistakes. Should be fixed now, my apologies.

Wednesday, September 21, 2011

Stump the Murph: ulimit, pam and linux

There is a game that a small group of friends and I have been playing since my Friendster years. It's called Stump the Murph. Basically if there is some weird problem in Linux mainly but it's in a variety of subjects-that we can't figure out we pass it to one of our friends Kevin Murphy. In 8 years I believe I stumped him once but I can't remember what it is so it doesn't count.

Here is the problem

SQLSTATE[HY000] [1135] Can't create a new thread (errno 11); if you are not out of available memory, you can consult the manual for a possible OS-dependent bug


"Obviously" this means you need to raise the ulimit for the process running mysql. I say "obviously" because this error means different things. In most cases it means that the server ran out of memory. perror 11 says OS error code  11:  Resource temporarily unavailable, yet when there is enough memory there may be a pam_limit issue. In my case there is.




So I did the following


in /etc/security/limits.conf I added this


mysql   soft    nofile  10240
mysql   hard    nofile  1537454
mysql   soft    nproc   32768
mysql   hard    nproc   65535

yet when I test the changes su - mysql
I get
 

su: pam_limits(su-l:session): Could not set limit for 'nofile': Operation not permitted

So my next course of action is to check

/etc/pam.d/system-auth

wait a second it has

session required pam_limits.so
and

/etc/pam.d/su calls

session         include         system-auth

thus I don't need to add session required pam_limits.so

Now the game of Stump the Murph begins:

In about 1/2 hour Murph figured out the solution! He deduced that since 

cat /proc/sys/fs/file-max
1537454

you can't set the hard limit of nofile to 1537454 because in theory you could starve the kernel from file descriptors thus from murph's suggestion I did

mysql      soft    nofile  10240
mysql      hard    nofile  768727
mysql      soft    nproc   32768
mysql      hard    nproc   65535

Thanks Murph!

Thursday, September 15, 2011

Amazon EBS mySQL, Disk Throughput and the Dual Edge of Software Raid

Amazon's EBS system is just a nice interface to a SAN subsystem, which manages the attachments of SAN LUNs. The problem with SAN when compared to Local SAS drives is latency and the shared controller, which caches IOPS for very distinct load profiles. Each load profile has an "optimized" cache profile from the SAN's redundant controller system. You may be able to attach petabytes of disks, but this system cannot utilize the true throughput when compared to small locally attached SAS Drives. Now the management portion of awesome. I love having the ability to mount more disk but I rarely need more space, I need speed.

How to get Speed out of Amazon's EBS volumes: Software RAID it!
mdadm --create /dev/md1 -v --raid-devices=8 --chunk=256 --level=raid10 /dev/xvdk /dev/xvdl /dev/xvdm /dev/xvdn /dev/xvdo /dev/xvdp /dev/xvdq /dev/xvdr

Take 8 EBS 125 GB volumes create a raid10 array with a 256KB chunk size. After various and mind numbing benchmarks I found that 256K is a good sweet spot. Feel free to do your own benches. The results have to be interpreted because of the nature of using a shared resource.

What I end up with is a 500GB partition, and I am roughly able to get around 22-25 MB of second of random I/O from 20 threads. To compare this to an 8 DISK 15K RPM PERC-6 2.5" SAS system I am able to get around 44 MB of second at a constant 1-2 ms response time for the same physical space. EBS volumes Response time per iop range from 6ms to 200ms. This sucks. Note: these numbers are based on RANDOM I/O 16KB Page size (4 iops per block write), what INNODB uses not sequential I/O.

Here is some iostat numbers from a live box with this configuration
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.83    0.00    1.75   22.32    0.08   74.01

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
xvdap1            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
xvdh              0.00     0.00    0.00    1.00     0.00     8.00     8.00     0.00    0.00   0.00   0.00
xvdk              0.00     0.00   34.40   26.40  1100.80  1503.20    42.83     0.49    8.01   6.39  38.88
xvdl              0.00     0.00   13.20   26.40   422.40  1503.20    48.63     0.27    6.71   4.38  17.36
xvdm              0.00     0.20   32.40   27.00  1036.80  1524.20    43.11     0.30    5.13   4.19  24.88
xvdn              0.00     0.20    9.40   27.00   300.80  1524.20    50.14     0.15    4.11   2.48   9.04
xvdo              0.00     0.00   30.20   27.40   968.00  1496.80    42.79     0.45    7.76   6.56  37.76
xvdp              0.00     0.00   14.60   27.40   478.40  1496.80    47.03     0.22    5.26   3.92  16.48
xvdq              0.00     0.00   31.20   25.60   998.40  1501.60    44.01     0.38    6.73   5.32  30.24
xvdr              0.00     0.00    9.80   25.60   313.60  1501.60    51.28     0.16    4.50   2.35   8.32
md1               0.00     0.00  174.80   98.60  5606.40  6009.80    42.49     0.00    0.00   0.00   0.00

So, now that I have acceptable speed what is the drawback? A weekly cron job that runs a check across the raid array. On Amazon’s EBS system it cuts my throughput in 1/2

For my Amazon Linux system the cron job is located
-rwxr-xr-x 1 root root 2770 Jan 16  2011 /etc/cron.weekly/99-raid-check

It essentially runs

echo check > /sys/block/md1/md/sync_action

Yet, the check lasts for around 9000 min or 6.25 days! Thus I will only have .75 days of full throughput.

So to stop this I must run
echo idle > /sys/block/md1/md/sync_action

I do not recommend turning off the check, its needed. Now to find out a way to make this check happen faster.