mySQL DBA, Architecture, Dev, Scale, HA, Code : order by

Tuesday, May 19, 2009

Multi Direction Sorts and avoiding a file sort

There are two PRIMARY directions to sort data in SQL: Ascending (ASC) and Descending DESC.
When these two sort definitions are put together in a single statement a filesort is produced.

Why do we want to avoid filesorts?

Filesorts are bad. 1st they tickle a thread based buffer called sort_buffer_size. Additionally filesorts reads the data twice, unless max_length_for_sort_data limit is reached and as a result the Filesort runs slower to reduce disk I/O. If you want filesorts to run faster at the expense of the disk increase the default max_length_for_sort_data. You can read the filesort algorithm here.

So, here is an example


 CREATE TABLE `ABCD` (
  `A` int(10) unsigned NOT NULL default '0',
  `B` int(10) unsigned NOT NULL default '0',
  `C` int(10) unsigned NOT NULL default '0',
  `D` int(10) unsigned NOT NULL default '0',
  PRIMARY KEY  (`a`,`b`,`c`,`d`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1


mysql> explain SELECT * FROM ABCD WHERE a=1 AND b=1 ORDER BY c DESC, d ASC\G      

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ABCD
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: const,const
         rows: 2
        Extra: Using where; Using index; Using filesort
1 row in set (0.00 sec)

Notice the filesort? So how does one get around this filesort?

Well

Let's define some roles for columns C and D. C is the parent while D is the child.

We want all the latest parents (C)

We want all the oldest children (D)

We require pagination of all the PARENTS (show 10 parents per page) so Queries like this is PRODUCED


SELECT * FROM ABCD WHERE A=? AND B=? ORDER BY C DESC
explain SELECT * FROM ABCD WHERE a=1 AND b=1 ORDER BY c DESC LIMIT 10\G   
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ABCD
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: const,const
         rows: 2
        Extra: Using where; Using index
1 row in set (0.00 sec)

Now

FOREACH($C_parent as $i => $c_id) {

$C_parent[$i] = SELECT SQL_CALC_FOUND_ROWS * FROM ABCD WHERE A=? AND B=? AND C=$c_id ORDER BY D ASC LIMIT 1;

}

So, we changed 1 query into 11 queries (10 parents per page) to make the page load happen faster, by getting rid of the filesort.

What 11 queries is faster then 1? Yes, for this case it is. The reason is because filesorts are SLOOOOW, they chew up a lot of limited resources and they should be avoided. I've see filesorts take close to 50-60% of the query time.

Wednesday, June 11, 2008

How to pick indexes for order by and group by queries

First some of the things that you need to use and understand

Explain Syntax

Order by Optimization

Group by Optimization

Update: Updated errors.

Now some details that are usually missed. GROUP BY does sorting unless you tell mysql not to. GROUP BY has two optimization methods, loose index scan, and tight index scan.

Loose index scan, scans the entire table index, while tight index scan uses some sort of constraint. For large datasets that are accessed often and require some sort of group by, tight index scans are better.

So how to pick columns to create the optimal indexes. Here is a list of practices (rules) that I personally follow:

1. What is the question asking?
2. What current indexes are on the table?
3. Can the query be re-written to use an existing index?
4. What is the overhead of adding a new index?
5. Follow left-most prefix rules
6. Build indexes that remove filesorts and temporary tables, for all query types.
7. Use statements that reduce the rows examined, and keep each index small. Note: each secondary index in INNODB requires its own page.

So lets look at a table

T(c1,c2,c3,c4)
PRIMARY(c1,c2);

A question asking

SELECT c1, c2, c3, SUM(c4) FROM T WHERE c1 = ? GROUP BY c1,c2,c3 ORDER BY c3 DESC LIMIT 10;

Instinct is to add an INDEX

IDX(c1,c2,c3,c4); Since c1 is the constraint, grouped by c2,c3 SUMMING c4.

BUT when running this condition through explain you'll see that in the EXTRA column

Using where; Using temporary; Using filesort

This is bad, it means WHERE is constraining the clause but the GROUP BY and ORDER BY is producing a temporary table, and a second pass to sort the data.

So your query time is find the rows - put them in temp table which can hit disk, sort them randomly tickling a few thread based buffers. Queries such as these do not scale and can hog up memory.

Let's breakdown what the question is asking.

Give me all the results for c1 where c1 == ?, flatten the results, ORDER the results by the MAX of c3, SUM c4.

Now that we know what the question is asking, lets "re-word" the question

SELECT c1, c2, c3, SUM(c4) FROM T WHERE c1 = ? GROUP BY c2, c3 ORDER BY c3 DESC, c2 DESC LIMIT 10;

This says

Give me all the results for c1 where c1 ==?, flatten the results by the MAX of c3, SUM c4 and if c3.rowN-1 == c3.rowN the tie breaker will be c2.

This is the same asking question reworded. Since GROUP BY AND ORDER BY is sorting the MAX of c3-c2 column refers to the same c2 values as the original query output.

The IDX is now

IDX(c1,c3,c2) and explain produces an extra of

Using WHERE

This is still not good enough-since there is an additional seek to return the columns asked for. Also SUM is being done on the data and not on the IDX.

To fix this IDX should be

IDX(c1,c3,c2,c4) and now explain produces an extra of

Using WHERE; Using index.

What does Using index mean? This means that mysql will not have to do an additional seek to read the actual row.

Now group by and order by have been optimized with adding an additional key.

How did I know to re-word the question and it would produce the same result?
Lets look at order by: That is the ending statement that changes the data. So, since c3 is also used to flatten the constraint I knew that c3 is all that is needed to refer to the same rows, c2 is used to be a tie-breaker when c3RowN-1 == c3RownN.

These are the sort of hints you can use to optimize SQL.