mySQL DBA, Architecture, Dev, Scale, HA, Code : Reporting Across Shards

Monday, June 01, 2015

Reporting Across Shards

If you have chosen to split your data across boxes, and architected your app to not query across boxes there is still a case where you will need to. Data mining, reports and data health checks require hitting all servers at some point. The case I am going over is sessions and figuring out the Session Length without taking averages of averages which is wrong.

Let's assume you have a session table of the following

mysql> describe sessions;
+----------+---------------------+------+-----+---------+-------+
| Field    | Type                | Null | Key | Default | Extra |
+----------+---------------------+------+-----+---------+-------+
| user_id  | bigint(20) unsigned | NO   | PRI | 0       |       |
| added_ms | bigint(20) unsigned | NO   | PRI | 0       |       |
| appVer   | varchar(8)          | YES  |     | NULL    |       |
| device   | bigint(20) unsigned | YES  | MUL | NULL    |       |
| start    | int(10) unsigned    | NO   | MUL | NULL    |       |
| stop     | int(10) unsigned    | NO   |     | NULL    |       |
+----------+---------------------+------+-----+---------+-------+



The data is federated (distributed) by user_id. This table exists across 1000s of servers. How do you get the average session length for the month of May?





The question already scopes the process to hit every single server
Second we can't just take AVG((stop-start)) and then sum and divide that by the number of shards
We can't pull all the data in memory
We don't want to have to pull the data and upload it to BigQuery or Amazon RedShift
We want a daily report at some point







SELECT SUM((stop-start)) as sess_diff, count(*) as sess_sample FROM sessions WHERE start BETWEEN $start AND $stop AND stop>start




The above SQL statement says for the connection to a single server give me the sum of the session delta and count the corresponding rows in the set. In this case the SUM of SUMs (sum of session_delta) is the numerator and the sum of sess_sample is the denominator.




Now do this across all servers and finally write some client code to take a few rows < 1000 to report the number.








$total = 0;
$sessions_diff = 0;

foreach ($rows as $shard_id => $result) {

    $sessions_diff = \bcadd($sessions_diff, $result[0]['sess_diff']);
    $total = \bcadd($total, $result[0]['sess_sample']);
}

Now the session_avg = sessions_diff/total

Tada a query that can take hours if done on a traditional mining server is done in ms.

Monday, June 01, 2015

Reporting Across Shards

No comments: