In the beginning, should a developer try and stay away from complex JOINS? Is it better to use many smaller queries and cache the results (using something like memcached) rather than 1 larger query requiring multiple table JOINS?
Needing to scale for millions of users is a great problem to have, but should I worry about it in the beginning? Whats the best approach to db design for day 1?
Well IMHO build the application, build it to spec. When the scope changes (and this will always happen) flow with it. This is in no means a knock against Product-I actually look forward to this, because things are being discussed and if something needs to change lets do it!
Now how to build the application to scale: Look at the scope of a page and gauge the frequency of access of that page. For instance on a page displaying 1 photo-very few if any ranges and/or group-by and/or order-by and/or joins would be used to generate this page since a photo page needs to generate fast or loose the interest of the page viewer. On the other hand a page displaying the accounts holder account information thats accessed at a order of magnitude less rate-this page can have tougher queries. (Note: watch out for contention)
Once the feature set has been decided upon; take the application apart and find the bottle necks. Highlight the issues that will not scale.
This can be figured out by filling your dev-database with bogus information and running mysqlslap of common select queries to test stability, response time, speed etc.
Take into account how scale can be achieved before changing code.
Such as is the hardware ok to run all my services?
Can I split some services out like move WWW specific traffic to a separate box from the database? (You will be surprised how many people combine the servers)
Am I allocating mysql resources properly? If not how can I tell?
What is my system reporting as the lack of resources? Am I running out of CPU or IO?
The real performance gains come from laying out the data properly and reducing the amounts of sorts, ranges, joins needed to generate your dynamic pages.
In some cases I've had to add tables where data in said tables represent start points to range in other tables. Then on top of that spread that core-logic across MANY individual machines just to get a page to load in less then 300 ms.
Finally add Nagios to alert you when things go wrong such as too many procs running on this database server and Ganglia to chart system resources over time.
- So, to answer the question should I use complex joins at 1st?
- Sure, but here is a rule of thumb-the harder the query is the less it can be used.
- If traffic ramps up, look to see what the hardest queries are and their frequency of use. Try to reduce these execution times by spreading the data out to more tables or more servers, only if benchmarks show that this is the way to go.
- If traffic really ramps up because the application is the next best thing since Flickr, then look at the larger picture of what should be done to scale.