Wednesday, June 11, 2008

How to pick indexes for order by and group by queries

First some of the things that you need to use and understand

Explain Syntax

Order by Optimization

Group by Optimization

Update: Updated errors.

Now some details that are usually missed. GROUP BY does sorting unless you tell mysql not to. GROUP BY has two optimization methods, loose index scan, and tight index scan.

Loose index scan, scans the entire table index, while tight index scan uses some sort of constraint. For large datasets that are accessed often and require some sort of group by, tight index scans are better.


So how to pick columns to create the optimal indexes. Here is a list of practices (rules) that I personally follow:

1. What is the question asking?
2. What current indexes are on the table?
3. Can the query be re-written to use an existing index?
4. What is the overhead of adding a new index?
5. Follow left-most prefix rules
6. Build indexes that remove filesorts and temporary tables, for all query types.
7. Use statements that reduce the rows examined, and keep each index small. Note: each secondary index in INNODB requires its own page.

So lets look at a table

T(c1,c2,c3,c4)
PRIMARY(c1,c2);


A question asking

SELECT c1, c2, c3, SUM(c4) FROM T WHERE c1 = ? GROUP BY c1,c2,c3 ORDER BY c3 DESC LIMIT 10;

Instinct is to add an INDEX

IDX(c1,c2,c3,c4); Since c1 is the constraint, grouped by c2,c3 SUMMING c4.

BUT when running this condition through explain you'll see that in the EXTRA column

Using where; Using temporary; Using filesort


This is bad, it means WHERE is constraining the clause but the GROUP BY and ORDER BY is producing a temporary table, and a second pass to sort the data.

So your query time is find the rows - put them in temp table which can hit disk, sort them randomly tickling a few thread based buffers. Queries such as these do not scale and can hog up memory.

Let's breakdown what the question is asking.

Give me all the results for c1 where c1 == ?, flatten the results, ORDER the results by the MAX of c3, SUM c4.

Now that we know what the question is asking, lets "re-word" the question

SELECT c1, c2, c3, SUM(c4) FROM T WHERE c1 = ? GROUP BY c2, c3 ORDER BY c3 DESC, c2 DESC LIMIT 10;

This says

Give me all the results for c1 where c1 ==?, flatten the results by the MAX of c3, SUM c4 and if c3.rowN-1 == c3.rowN the tie breaker will be c2.



This is the same asking question reworded. Since GROUP BY AND ORDER BY is sorting the MAX of c3-c2 column refers to the same c2 values as the original query output.

The IDX is now

IDX(c1,c3,c2) and explain produces an extra of

Using WHERE


This is still not good enough-since there is an additional seek to return the columns asked for. Also SUM is being done on the data and not on the IDX.

To fix this IDX should be

IDX(c1,c3,c2,c4) and now explain produces an extra of

Using WHERE; Using index.


What does Using index mean? This means that mysql will not have to do an additional seek to read the actual row.

Now group by and order by have been optimized with adding an additional key.

How did I know to re-word the question and it would produce the same result?
Lets look at order by: That is the ending statement that changes the data. So, since c3 is also used to flatten the constraint I knew that c3 is all that is needed to refer to the same rows, c2 is used to be a tie-breaker when c3RowN-1 == c3RownN.


These are the sort of hints you can use to optimize SQL.

5 comments:

Anonymous said...

I may be missing something here, but the "before" and "after" queries don't look equivalent to me. Try the following dataset:

http://legionsgs.com/xtra/garth/TestData.csv

As written, the queries don't return the same sets of records. If you remove the LIMIT clauses, they return the same records, but in a different order.

However, if you really don't care about the order, just remove the ORDER BY clause from the original query. If you do that, it already runs optimally under IDX(c1, c2, c3, c4); no need to do any reorganizing.

Dathan Pattishall said...

No your write, I have a typo fixing.

Anonymous said...

hehe, you see it's not easy to work out a good index even for this relatively simple case. If you're interested in automated index verification/generation, I recommend you check out an open source tool I'm working on: http://ritmark.com

BR,
Vladimir

kellan said...

Absolute best info I could find on picking indexes for group bys, but I'm still having a bit of trouble applying it.

But what if the query was:

SELECT c1, c2, c3, SUM(c4) FROM T WHERE c1 = ? GROUP BY c2 ORDER BY c3 DESC LIMIT 10;

Dathan Pattishall said...

But what if the query was:

SELECT c1, c2, c3, SUM(c4) FROM T WHERE c1 = ? GROUP BY c2 ORDER BY c3 DESC LIMIT 10;


That would produce a temp table and a filesort.

The reason the index is c1,c2,c3,c4
So where c1=? and the group by of c2 would use that index, but to order the data properly you would need to do

ORDER BY c1 DESC, c2 DESC, c3 DESC

to get rid of the temp table or filesort.

I believe in 5.1 that this case is being worked on in the optimizer level.