Monday, June 01, 2015

Reporting Across Shards

If you have chosen to split your data across boxes, and architected your app to not query across boxes there is still a case where you will need to. Data mining, reports and data health checks require hitting all servers at some point. The case I am going over is sessions and figuring out the Session Length without taking averages of averages which is wrong.


Let's assume you have a session table of the following

mysql> describe sessions;
+----------+---------------------+------+-----+---------+-------+
| Field    | Type                | Null | Key | Default | Extra |
+----------+---------------------+------+-----+---------+-------+
| user_id  | bigint(20) unsigned | NO   | PRI | 0       |       |
| added_ms | bigint(20) unsigned | NO   | PRI | 0       |       |
| appVer   | varchar(8)          | YES  |     | NULL    |       |
| device   | bigint(20) unsigned | YES  | MUL | NULL    |       |
| start    | int(10) unsigned    | NO   | MUL | NULL    |       |
| stop     | int(10) unsigned    | NO   |     | NULL    |       |
+----------+---------------------+------+-----+---------+-------+


The data is federated (distributed) by user_id. This table exists across 1000s of servers. How do you get the average session length for the month of May?

  • The question already scopes the process to hit every single server
  • Second we can't just take AVG((stop-start)) and then sum and divide that by the number of shards
  • We can't pull all the data in memory
  • We don't want to have to pull the data and upload it to BigQuery or Amazon RedShift
  • We want a daily report at some point
SELECT SUM((stop-start)) as sess_diff, count(*) as sess_sample FROM sessions WHERE start BETWEEN $start AND $stop AND stop>start
The above SQL statement says for the connection to a single server give me the sum of the session delta and count the corresponding rows in the set. In this case the SUM of SUMs (sum of session_delta) is the numerator and the sum of sess_sample is the denominator.
Now do this across all servers and finally write some client code to take a few rows < 1000 to report the number.

$total = 0;
$sessions_diff = 0;

foreach ($rows as $shard_id => $result) {

    $sessions_diff = \bcadd($sessions_diff, $result[0]['sess_diff']);
    $total = \bcadd($total, $result[0]['sess_sample']);
}


Now the session_avg = sessions_diff/total

Tada a query that can take hours if done on a traditional mining server is done in ms.


No comments: