Wednesday, September 06, 2006

4.1 Optimizer and Joins

The Optimizer in 4.1 is weird. When doing joins, I expect it to figure out how to pick which tables to lookup 1st and compare in a correct manor. My expectations is a bit to much.


For instance say you have table:

  • A with 1 million rows

  • B with 10 million rows

  • C with 100 million rows



So, doing a small range on table A and taking these results to filter out with the other tables I expect the join order to be

A, B, C

Yet, the mySQL optimizer in many cases will join the table in the order of

B, A, C.

This is wrong. I know that the range generated from A is smaller then the range generated from B.

To get around this I use STRAIGHT_JOIN in a global context


SELECT STRAIGHT_JOIN SQL_CALC_FOUND_ROWS A.*, B.*, C.* FROM A,B,C WHERE A.owner_id = 123 AND A.id=B.id AND C.range_id=3 AND C.a_id=B.id ORDER BY A.key_part_2 DESC LIMIT 0,100


How do I know this is the correct order?

Well, order is based on the smallest number of results that will return of the elements involved.

A by itself will return 10 rows
B by itself will return 100 rows
C by itself will return 1000 rows

So A's 10 rows reduces B to 10, then C will reduce A,B to 5 since out of A,B set only 5 rows intersect with the entire C set.


5.0 fix some major deficiencies with the optimizer, but it's far from perfect. Personally, I think its just marginally better then 4.1, although I have not played with it as much as 4.1's.

What would be cool is for mySQL to build a Machine Learning algorithm to make the optimizer learn this stuff over a period of time. Maybe this could be a new type of index that could be made to dynamically act like a BTREE for any desired question. Decision Trees may hold the answer, and new ways of keeping data organized. I think this would be a cool feature for mySQL, which they can use to get customers to pay a subscription fee for.

10 comments:

Anonymous said...

You're right about 5.0 not being much better in this regard. It often gives me this problem with complex joins. I'm just glad the STRAIGHT_JOIN directive gives us a tool to fix it until they figure out a way to make the optimizer smarter.

Anonymous said...

Would there ever be a case where it would be more intelligent to read a marginally larger result set first (e.g. 100 rows vs 10), if it meant that MySQL could read the rows consecutively, instead of having to jump around to match rows within a large table? Would it be faster to do the seeking around within the small table?

Anonymous said...

MySQL literally does a cross join of 111 million rows, and then filters out rows via the WHERE clause. I'm surprised you're not complaining about the performance.

Is this really true? Trying out a few queries both ways on 5.0.20 I see identical explain plans and execution times. If the where clause was simply a filter after the cross join, then all ',' joined queries would run horrendously surely?

I'm not saying that there aren't cases where use of INNER JOIN doesn't result in better query plans, but I've never really noticed the 'cross join then filter' behaviour you mention for ',' joins.

Do you have any examples?

Cheers! :)


As an aside and not in reply to any particular comment, I was reading this earlier - highly recommend it -

http://dev.mysql.com/doc/internals/en/optimizer.html

I must admit I was a bit surprised to find that the optimiser isn't anywhere near as clever as I thought it was in some circumstances. (see the bit about index choice sometimes being based on the order that indexes were created in! So odd.)

Toasty

Anonymous said...

The B,A,C join order is especially odd as you've no WHERE criteria limiting the rows returned from the B table, so I'm not sure why the DB would pick B to start with at all.

I suppose if you've got no indexes on A.owner_id or C.range_id then the DB might not think that they're vastly better choices to start with - but even so you'd expect the smaller table to be used first. Of course this assumes that overall table row count is indeed something mysql uses for join optimisation - and I'm not sure about this.

I'd start with running ANALYZE TABLE to update your index stats if you can as Sheeri suggests. (Not sure how much it will help if you've no indexes on A.owner_id or C.range_id though)


Personally I've only ever seen anything as odd as this with tables with index stats which are waaay out.

Oh, and what does the explain plan look like? It might well give a clue as to why MySQL has 'optimised' the query this way.

Toasty

Anonymous said...

A join is between two tables right?
How can there be a difference between joining A to B or B to A?
You are joining A AND B.

If I sound snotty, I don't mean to, I'm genuinly interested in the difference.

I'm also curious about a point sheeri brought up. I tend to put WHERE conditions in an ON clause for no particular reason. Is that a benefit?

Anonymous said...

A join is between two tables right?
How can there be a difference between joining A to B or B to A?
You are joining A AND B.


Logically yeah they're the same, but in practice the order of joining can make a huge perf difference.

AFAIK MySQL always joins tables using a nested join, where it looks at one table (using an index scan say) and for each value separately looks up the corresponding value in the second table (using an index lookup say)

So if you have table A with 10 rows joining to table B with 1000000 rows it's likely to be much quicker to index scan A and do 10 index lookups on B (one for each row in A) than it is to index scan B and do 1000000 index lookups on A.

In addition, if you have further conditions/restrictions on a particular table (e.g. say your query contains WHERE B.somecol = 1 as well as the join) then AFAIK MySQL will, if somecol is indexed, have a rough idea of how many rows occur in table B where somecol=1, and so will be able to determine whether it makes sense to use table B as the 'driving' table instead of A.

e.g. if there is only 1 row where somecol=1 in table B, it probably makes sense to find that row in B first, then index lookup on table A with that row, because you need to do it just once.

Does that make any sense? It's early here :)

Toasty

Anonymous said...

Thanks Toasty, it makes sense now.

Dathan Pattishall said...

In response to sheeri.

INNER JOIN and CROSS JOIN or the comma operator are equivelent. They all produce a cartesian product.

INNER JOIN and , (comma) are semantically equivalent in the absence of a join condition: both produce a Cartesian product between the specified tables (that is, each and every row in the first table is joined to each and every row in the second table).


Join info


As for what table I'm using, it's INNODB, using myISAM in a high load production environment when doing billions of queires per day is not good. In terms of speed it's ok, but stability it's not.

Certain conditions involving deletes will cause the myISAM table to mark itself as corrupt, and require a repair.

Not to mention the maintenance needed to keep the stats on the table in order.

Thus the need for a storage engine like INNODB, for my types of problems.

When Falcon comes out and is rock solid, I hope to put that into production.

Anonymous said...

Certain conditions involving deletes will cause the myISAM table to mark itself as corrupt, and require a repair.
Not to mention the maintenance needed to keep the stats on the table in order.
Thus the need for a storage engine like INNODB, for my types of problems.


Ooh I've never come across this problem, but we're not dealing with anywhere near as many queries or as much data as you have to. (probably around 50 million queries on a busy day, but with only maybe 50GB of data and relatively few deletes on the MyISAM stuff.)

Do you know the specifics or did it just happen sometimes? If you do know the specifics I'd be very interested to hear!


>When Falcon comes out and is rock solid, I hope to put that into production.

Yeah it sounds very promising. The presentation video by Jim Starkey is definitely an interesting watch.

Cheers,
Toasty

Anonymous said...

"What would be cool is for mySQL to build a Machine Learning algorithm to make the optimizer learn this stuff over a period of time."
Whar a great idea.

Chris Poust Eurotemp