Tuesday, November 27, 2007

When starting out building an application should I use Complex Joins?

I get questions on how to build mysql application from time to time, and I wanted to share the response to the question below.

In the beginning, should a developer try and stay away from complex JOINS? Is it better to use many smaller queries and cache the results (using something like memcached) rather than 1 larger query requiring multiple table JOINS?

Needing to scale for millions of users is a great problem to have, but should I worry about it in the beginning? Whats the best approach to db design for day 1?

Well IMHO build the application, build it to spec. When the scope changes (and this will always happen) flow with it. This is in no means a knock against Product-I actually look forward to this, because things are being discussed and if something needs to change lets do it!

Now how to build the application to scale: Look at the scope of a page and gauge the frequency of access of that page. For instance on a page displaying 1 photo-very few if any ranges and/or group-by and/or order-by and/or joins would be used to generate this page since a photo page needs to generate fast or loose the interest of the page viewer. On the other hand a page displaying the accounts holder account information thats accessed at a order of magnitude less rate-this page can have tougher queries. (Note: watch out for contention)

Once the feature set has been decided upon; take the application apart and find the bottle necks. Highlight the issues that will not scale.

This can be figured out by filling your dev-database with bogus information and running mysqlslap of common select queries to test stability, response time, speed etc.

Take into account how scale can be achieved before changing code.

Such as is the hardware ok to run all my services?
Can I split some services out like move WWW specific traffic to a separate box from the database? (You will be surprised how many people combine the servers)
Am I allocating mysql resources properly? If not how can I tell?
What is my system reporting as the lack of resources? Am I running out of CPU or IO?

The real performance gains come from laying out the data properly and reducing the amounts of sorts, ranges, joins needed to generate your dynamic pages.

In some cases I've had to add tables where data in said tables represent start points to range in other tables. Then on top of that spread that core-logic across MANY individual machines just to get a page to load in less then 300 ms.

Finally add Nagios to alert you when things go wrong such as too many procs running on this database server and Ganglia to chart system resources over time.

    So, to answer the question should I use complex joins at 1st?
  • Sure, but here is a rule of thumb-the harder the query is the less it can be used.

  • If traffic ramps up, look to see what the hardest queries are and their frequency of use. Try to reduce these execution times by spreading the data out to more tables or more servers, only if benchmarks show that this is the way to go.

  • If traffic really ramps up because the application is the next best thing since Flickr, then look at the larger picture of what should be done to scale.


Dale L. said...

I always say to let the database do the heavy lifting when it comes to managing data - that's what it was built for. If you don't put the complexity in your join then you will put it somewhere else within your app. If you believe that you will have millions of users, so be it, I can guarantee you the DB will handle the selection of the data you need much, much better than you can do it with several queries and application logic.

David H said...

I have found in *all* of my experience is that the number one point of contention on websites is I/O, and JOINs are a wonderful way to increase I/O. The question is -- to what extent? And... what are possible alternatives (as putting data into arrays can run memory low).

Thus, this is a semi-indicator of what one can expect in the future and make adjustments ahead of time before you start spending 23 hour days fixing issues that could have been resolved earlier:

First Step: You write your code in dev.

Second Step: Create monster-size tables/databases on a test server (not the development server nor the production server) and run a benchmark script all day long.

Third Step: Make adjustments and repeat.

This will ease some future worries and will avoid the "running around like a chicken with its head cut off" syndrome in the case of unanticipated success.


toby said...

I would endorse David H's approach, as we have just been through this cycle in our application. Create loads of test data - more than you expect to occur in reality - and try your queries. We found a variety of unexpected issues this way. A slow query can cause a cascade of issues including deadlock timeouts, etc.

mitch said...

thank you!