mySQL DBA, Architecture, Dev, Scale, HA, Code : Asynchronous Queries verses Synchronous Queries

In a procedural language without the use of threads (or Inter Process Communication via forks), to execute I/O requests they are done one after another. Synchronous Queries produce at best a Big-O of N such that N is an element of I/O communication (queries) and N equals the number of queries needed to achieve the requested dataset.
With IPC or threads we can speed up common O(N) problems to reduce the N with parallelism, its still functionally a O(N) yet from a single instance point of view N is much less because threads (IPC) takes that Serial computing component and executes the code in parallel. To better explain what I am talking about lets look at some PHP code:


foreach($friends as $friend){
     $data[] = getMySQLData(“SELECT * FROM AccountData WHERE userid = $friend);
}

The Primary key for the AccountData table is userid. Assuming that you have 5000 friends, the query has to be executed 5000 times.
We can reduce the O(N) and change it to a O(nlogn) (Binary Tree - doesn't take into account other factors) by switching the query to


$data = getMYSQLData(“SELECT * FROM AccountData WHERE userid IN (….)”);

We just sped up the retrieval of the data significantly, yet we just introduced a bottle neck on the datalayer. Our architecture requires that the data is located in a single location.

What if AccountData’s data is spread across many servers federated by userid? This means that userid belongs to a server, so the server contains a shard of the AccountData’s Data.

Now we are back to a O(N) where each query needs to be executed on the corresponding shard. The logical next step is to group queries per shard and run across them all. For instance


$multiShardIDs = $genericShard->getMultipleShardIDs($objIds);
foreach ($multiShardIDs as $shardID => $shardUserIDs) {
           if (stripos($orgQuery, " WHERE ") !== false){
               $query = $orgQuery." AND {$column} IN (".implode(',', $shardUserIDs).") ";
           }               
           else{
               $query = $orgQuery." WHERE {$column} IN (".implode(',', $shardUserIDs).") ";               
           }

           $shard_to_sql[$shardID] = $query;

           .... more stuff ....

Yet this is still a O(N) its just that N is smaller. Each query is still executed serially.
Let’s look at some stats of synchronous queries of SELECT 1; This query is executed across 35 shards and the timings are from PHP point of view.

Field	End Value	Start Value	Delta
ru_oublock	0	0	0
ru_inblock	0	0	0
ru_msgsnd	0	0	0
ru_msgrcv	0	0	0
ru_maxrss	0	0	0
ru_ixrss	0	0	0
ru_idrss	0	0	0
ru_minflt	9872	9865	7
ru_majflt	0	0	0
ru_nsignals	0	0	0
ru_nvcsw	11344	11114	230
ru_nivcsw	977	968	9
ru_nswap	0	0	0
ru_utime.tv_usec	865054	849053	16001
ru_utime.tv_sec	16	16	0
ru_stime.tv_usec	556097	552097	4000
ru_stime.tv_sec	1	1	0
Total Execution Time			0.18323707580566

As you can see, to execute this from PHP it took 100 ms, 100s pages reclaimed and 200s voluntary context switches to query 35 servers.

Now let’s look at Asynchronous execution of SELECT 1; // the query generation is from PHP yet the execution is performed on a server that executes the query in parallel

Field	End Value	Start Value	Delta
ru_oublock	0	0	0
ru_inblock	0	0	0
ru_msgsnd	0	0	0
ru_msgrcv	0	0	0
ru_maxrss	0	0	0
ru_ixrss	0	0	0
ru_idrss	0	0	0
ru_minflt	9131	9121	10
ru_majflt	0	0	0
ru_nsignals	0	0	0
ru_nvcsw	3891	3889	2
ru_nivcsw	290	290	0
ru_nswap	0	0	0
ru_utime.tv_usec	596287	596287	0
ru_utime.tv_sec	4	4	0
ru_stime.tv_usec	460028	460028	0
ru_stime.tv_sec	0	0	0
Total Execution Time			0.019363880157471

As you can see from the table above executing the query asynchronously produced results with less context switching, less pages reclaimed and almost 10 times execution improvement over the synchronous query counterpart.
How is the asynchronous query executed? Lets take a look at the figure below.

So A user comes through the firewall / load balancer with a HTTP Request to the www pool that runs PHP. PHP now makes a CURL request to the Async Shard Servers (through a LB same LB different PORT). The HTTP Request to the Async Shard Server contains the SQL we wish to execute. The Async Shard Servers has a thread per shard and executes the request in parallel. The results are merged and sent to the calling CURL process via JSON. The returned JSON is then converted into a PHP object. This is a typical three-tier environment.

When having to query multiple servers using an Asynchronous Tier is dramatically faster; in fact its as fast as the slowest server. This is the main sticking point of why asynchronous queries are faster then synchronous queries (in this context) since the total execution time for serial queries is the SUM of all the query execution.

The current version of the server is used for Friend Query execution across the datalayer. Its been solid for a few months now, and I'm currently getting permission to release it as an Open Source Product. The features this server contains:

Lightweight

CPU bounded

Scales Linearly

A Timer Thread to keep the database config up to date in memory and fetching the config from PHP so if PHP changes connections to the shards so does Java

Uses Java-6 Executor Service

Merges the result set prior to sending it to the calling process

Communicates via JSON

Uses MySQL Connector/J

Supports a high concurrency

Optimized thread usage

4 comments:

Unknown said...: Nice post, the power of async queries is yet to be realized by most, mostly due to the APIs not being there. Rather than introducing an entire new layer into the mix, I would suggest using non-blocking I/O directly on the first layer PHP machines. This is what I added into libdrizzle and the corresponding PHP extension. For example, you can queue N queries and have them run concurrently, possibly against multiple servers, all within PHP.

http://cvs.php.net/viewvc.cgi/pecl/drizzle/drizzle.php?view=markup

Towards the end of this is a "concurrent query interface" example.

-Eric; Mon Nov 09, 09:31:00 PM
Anonymous said...: http://blog.ulf-wendel.de/?p=201; Tue Nov 10, 12:04:00 AM
Anonymous said...: ... the link http://blog.ulf-wendel.de/?p=201 points you "PHP: How mysqlnd async queries help you with sharding!" from about a year ago.; Tue Nov 10, 01:15:00 AM
Dathan Pattishall said...: Thats awesome Eric. I will give it a try; Tue Nov 10, 12:47:00 PM

Monday, November 09, 2009

Asynchronous Queries verses Synchronous Queries

4 comments: