Showing posts with label index. Show all posts
Showing posts with label index. Show all posts

Wednesday, June 11, 2008

How to pick indexes for order by and group by queries

First some of the things that you need to use and understand

Explain Syntax

Order by Optimization

Group by Optimization

Update: Updated errors.

Now some details that are usually missed. GROUP BY does sorting unless you tell mysql not to. GROUP BY has two optimization methods, loose index scan, and tight index scan.

Loose index scan, scans the entire table index, while tight index scan uses some sort of constraint. For large datasets that are accessed often and require some sort of group by, tight index scans are better.


So how to pick columns to create the optimal indexes. Here is a list of practices (rules) that I personally follow:

1. What is the question asking?
2. What current indexes are on the table?
3. Can the query be re-written to use an existing index?
4. What is the overhead of adding a new index?
5. Follow left-most prefix rules
6. Build indexes that remove filesorts and temporary tables, for all query types.
7. Use statements that reduce the rows examined, and keep each index small. Note: each secondary index in INNODB requires its own page.

So lets look at a table

T(c1,c2,c3,c4)
PRIMARY(c1,c2);


A question asking

SELECT c1, c2, c3, SUM(c4) FROM T WHERE c1 = ? GROUP BY c1,c2,c3 ORDER BY c3 DESC LIMIT 10;

Instinct is to add an INDEX

IDX(c1,c2,c3,c4); Since c1 is the constraint, grouped by c2,c3 SUMMING c4.

BUT when running this condition through explain you'll see that in the EXTRA column

Using where; Using temporary; Using filesort


This is bad, it means WHERE is constraining the clause but the GROUP BY and ORDER BY is producing a temporary table, and a second pass to sort the data.

So your query time is find the rows - put them in temp table which can hit disk, sort them randomly tickling a few thread based buffers. Queries such as these do not scale and can hog up memory.

Let's breakdown what the question is asking.

Give me all the results for c1 where c1 == ?, flatten the results, ORDER the results by the MAX of c3, SUM c4.

Now that we know what the question is asking, lets "re-word" the question

SELECT c1, c2, c3, SUM(c4) FROM T WHERE c1 = ? GROUP BY c2, c3 ORDER BY c3 DESC, c2 DESC LIMIT 10;

This says

Give me all the results for c1 where c1 ==?, flatten the results by the MAX of c3, SUM c4 and if c3.rowN-1 == c3.rowN the tie breaker will be c2.



This is the same asking question reworded. Since GROUP BY AND ORDER BY is sorting the MAX of c3-c2 column refers to the same c2 values as the original query output.

The IDX is now

IDX(c1,c3,c2) and explain produces an extra of

Using WHERE


This is still not good enough-since there is an additional seek to return the columns asked for. Also SUM is being done on the data and not on the IDX.

To fix this IDX should be

IDX(c1,c3,c2,c4) and now explain produces an extra of

Using WHERE; Using index.


What does Using index mean? This means that mysql will not have to do an additional seek to read the actual row.

Now group by and order by have been optimized with adding an additional key.

How did I know to re-word the question and it would produce the same result?
Lets look at order by: That is the ending statement that changes the data. So, since c3 is also used to flatten the constraint I knew that c3 is all that is needed to refer to the same rows, c2 is used to be a tie-breaker when c3RowN-1 == c3RownN.


These are the sort of hints you can use to optimize SQL.