Monday, July 07, 2008

Group by ORDER by Optimization part II

In my previous blog post I talk about GROUP BY and ORDER BY optimizations. A member asked a great question that I'd like to share with everyone.


But what if the query was:

SELECT c1, c2, c3, SUM(c4) FROM T WHERE c1 = ? GROUP BY c2 ORDER BY c3 DESC LIMIT 10;




That query would produce a temp table and a filesort.
explain SELECT c1, c2, c3, SUM(c4) FROM column_test WHERE c1 = 1 GROUP BY c2 ORDER BY c3 DESC LIMIT 10\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: column_test
type: ref
possible_keys: c1
key: c1
key_len: 5
ref: const
rows: 1
Extra: Using where; Using index; Using temporary; Using filesort
1 row in set (0.00 sec)




The reason the index is c1,c2,c3,c4

So where c1=? and the group by of c2 would use that index, but to order the data properly you would need to do

WHERE c1 = ? GROUP BY c2, c3 ORDER BY c1 DESC, c2 DESC, c3 DESC

explain SELECT c1, c2, c3, SUM(c4) FROM column_test WHERE c1 = 1 GROUP BY c2,c3 ORDER BY c1, c2\G

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: column_test
type: ref
possible_keys: c1
key: c1
key_len: 5
ref: const
rows: 1
Extra: Using where; Using index
1 row in set (0.00 sec)

to get rid of the temp table or filesort. Filesorts and temp tables takes about 50% of the query time-so avoid these when the query is requested at a huge rate.

Also note that to get rid of the temporary table and filesort, the query changed and does not answer your question without post processing the data in PHP or some other layer.




The reason: the data is ordered by the index. For innodb the entire table is ordered by the PRIMARY key, each index has a reference to. The mysql implementation of group by does not know how to traverse and sort the data in the 1st pass of part of the key. The optimizer needs a lot of work. So, what is done from the mysql level is to automatically create a temp table and sort on that instead of using the index which it already traversed.

I believe in 5.1 that this case is being worked on in the optimizer level to get rid of this common slowdown.

1 comment:

womagrid said...

Suppose you have a lot of customer data that should always be sorted by time and id (primary key). A compound index on (time, id) would seem to be a good idea.

This works well when looking at data for all customers but half the time you're only interested in data pertaining to one customer. How do you then avoid the need to create an almost duplicate compound key (customer, time, id) to handle the addition of "where customer = ..." to the query?