Let's assume you have a session table of the following
mysql> describe sessions; +----------+---------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +----------+---------------------+------+-----+---------+-------+ | user_id | bigint(20) unsigned | NO | PRI | 0 | | | added_ms | bigint(20) unsigned | NO | PRI | 0 | | | appVer | varchar(8) | YES | | NULL | | | device | bigint(20) unsigned | YES | MUL | NULL | | | start | int(10) unsigned | NO | MUL | NULL | | | stop | int(10) unsigned | NO | | NULL | | +----------+---------------------+------+-----+---------+-------+
The data is federated (distributed) by user_id. This table exists across 1000s of servers. How do you get the average session length for the month of May?
- The question already scopes the process to hit every single server
- Second we can't just take AVG((stop-start)) and then sum and divide that by the number of shards
- We can't pull all the data in memory
- We don't want to have to pull the data and upload it to BigQuery or Amazon RedShift
- We want a daily report at some point
SELECT SUM((stop-start)) as sess_diff, count(*) as sess_sample FROM sessions WHERE start BETWEEN $start AND $stop AND stop>startThe above SQL statement says for the connection to a single server give me the sum of the session delta and count the corresponding rows in the set. In this case the SUM of SUMs (sum of session_delta) is the numerator and the sum of sess_sample is the denominator.Now do this across all servers and finally write some client code to take a few rows < 1000 to report the number.
$total = 0; $sessions_diff = 0; foreach ($rows as $shard_id => $result) { $sessions_diff = \bcadd($sessions_diff, $result[0]['sess_diff']); $total = \bcadd($total, $result[0]['sess_sample']); }
Now the session_avg = sessions_diff/total
Tada a query that can take hours if done on a traditional mining server is done in ms.
No comments:
Post a Comment