Showing posts with label order by. Show all posts
Showing posts with label order by. Show all posts

Tuesday, May 19, 2009

Multi Direction Sorts and avoiding a file sort

There are two PRIMARY directions to sort data in SQL: Ascending (ASC) and Descending DESC.
When these two sort definitions are put together in a single statement a filesort is produced.

Why do we want to avoid filesorts?

Filesorts are bad. 1st they tickle a thread based buffer called sort_buffer_size. Additionally filesorts reads the data twice, unless max_length_for_sort_data limit is reached and as a result the Filesort runs slower to reduce disk I/O. If you want filesorts to run faster at the expense of the disk increase the default max_length_for_sort_data. You can read the filesort algorithm here.

So, here is an example

CREATE TABLE `ABCD` (
`A` int(10) unsigned NOT NULL default '0',
`B` int(10) unsigned NOT NULL default '0',
`C` int(10) unsigned NOT NULL default '0',
`D` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`a`,`b`,`c`,`d`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1



mysql> explain SELECT * FROM ABCD WHERE a=1 AND b=1 ORDER BY c DESC, d ASC\G

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ABCD
type: ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 8
ref: const,const
rows: 2
Extra: Using where; Using index; Using filesort
1 row in set (0.00 sec)



Notice the filesort? So how does one get around this filesort?

Well

Let's define some roles for columns C and D. C is the parent while D is the child.

  • We want all the latest parents (C)
  • We want all the oldest children (D)

    We require pagination of all the PARENTS (show 10 parents per page) so Queries like this is PRODUCED

    SELECT * FROM ABCD WHERE A=? AND B=? ORDER BY C DESC
    explain SELECT * FROM ABCD WHERE a=1 AND b=1 ORDER BY c DESC LIMIT 10\G
    *************************** 1. row ***************************
    id: 1
    select_type: SIMPLE
    table: ABCD
    type: ref
    possible_keys: PRIMARY
    key: PRIMARY
    key_len: 8
    ref: const,const
    rows: 2
    Extra: Using where; Using index
    1 row in set (0.00 sec)




    Now

    FOREACH($C_parent as $i => $c_id) {

    $C_parent[$i] = SELECT SQL_CALC_FOUND_ROWS * FROM ABCD WHERE A=? AND B=? AND C=$c_id ORDER BY D ASC LIMIT 1;


    }

    So, we changed 1 query into 11 queries (10 parents per page) to make the page load happen faster, by getting rid of the filesort.

    What 11 queries is faster then 1? Yes, for this case it is. The reason is because filesorts are SLOOOOW, they chew up a lot of limited resources and they should be avoided. I've see filesorts take close to 50-60% of the query time.
  • Wednesday, June 11, 2008

    How to pick indexes for order by and group by queries

    First some of the things that you need to use and understand

    Explain Syntax

    Order by Optimization

    Group by Optimization

    Update: Updated errors.

    Now some details that are usually missed. GROUP BY does sorting unless you tell mysql not to. GROUP BY has two optimization methods, loose index scan, and tight index scan.

    Loose index scan, scans the entire table index, while tight index scan uses some sort of constraint. For large datasets that are accessed often and require some sort of group by, tight index scans are better.


    So how to pick columns to create the optimal indexes. Here is a list of practices (rules) that I personally follow:

    1. What is the question asking?
    2. What current indexes are on the table?
    3. Can the query be re-written to use an existing index?
    4. What is the overhead of adding a new index?
    5. Follow left-most prefix rules
    6. Build indexes that remove filesorts and temporary tables, for all query types.
    7. Use statements that reduce the rows examined, and keep each index small. Note: each secondary index in INNODB requires its own page.

    So lets look at a table

    T(c1,c2,c3,c4)
    PRIMARY(c1,c2);


    A question asking

    SELECT c1, c2, c3, SUM(c4) FROM T WHERE c1 = ? GROUP BY c1,c2,c3 ORDER BY c3 DESC LIMIT 10;

    Instinct is to add an INDEX

    IDX(c1,c2,c3,c4); Since c1 is the constraint, grouped by c2,c3 SUMMING c4.

    BUT when running this condition through explain you'll see that in the EXTRA column

    Using where; Using temporary; Using filesort


    This is bad, it means WHERE is constraining the clause but the GROUP BY and ORDER BY is producing a temporary table, and a second pass to sort the data.

    So your query time is find the rows - put them in temp table which can hit disk, sort them randomly tickling a few thread based buffers. Queries such as these do not scale and can hog up memory.

    Let's breakdown what the question is asking.

    Give me all the results for c1 where c1 == ?, flatten the results, ORDER the results by the MAX of c3, SUM c4.

    Now that we know what the question is asking, lets "re-word" the question

    SELECT c1, c2, c3, SUM(c4) FROM T WHERE c1 = ? GROUP BY c2, c3 ORDER BY c3 DESC, c2 DESC LIMIT 10;

    This says

    Give me all the results for c1 where c1 ==?, flatten the results by the MAX of c3, SUM c4 and if c3.rowN-1 == c3.rowN the tie breaker will be c2.



    This is the same asking question reworded. Since GROUP BY AND ORDER BY is sorting the MAX of c3-c2 column refers to the same c2 values as the original query output.

    The IDX is now

    IDX(c1,c3,c2) and explain produces an extra of

    Using WHERE


    This is still not good enough-since there is an additional seek to return the columns asked for. Also SUM is being done on the data and not on the IDX.

    To fix this IDX should be

    IDX(c1,c3,c2,c4) and now explain produces an extra of

    Using WHERE; Using index.


    What does Using index mean? This means that mysql will not have to do an additional seek to read the actual row.

    Now group by and order by have been optimized with adding an additional key.

    How did I know to re-word the question and it would produce the same result?
    Lets look at order by: That is the ending statement that changes the data. So, since c3 is also used to flatten the constraint I knew that c3 is all that is needed to refer to the same rows, c2 is used to be a tie-breaker when c3RowN-1 == c3RownN.


    These are the sort of hints you can use to optimize SQL.