In a web environment if a database is connected to on the fly, each connection made should take milliseconds-when the net is healthy. However, when a database server crashes a timeout for each connection takes seconds and there are cases where connections can just hang (recovery of INNODB as an example). For instance have you ever had a ssh session hang and a kill (SIGHUP..) does not work? This happens for mySQL client connections too.
I want the front ends to recover gracefully from a no-route to host, or more notably a hanged connection condition. I want to avoid that damn timeout all together on stateless connections. Why? Because if you use up all your worker httpd threads requests Fail-things crash, bosses get mad, everyone wakes up and you have a meeting about it later. So what are ways around this?
Use a Load balancer
Use a memcache layer to keep DB state
Use IRON DNS:
For an internal network, all database entries should be able to fit in memory. All resolves should take less then 2ms, and updates to DNS entries can happen in just a few seconds across your entire farm. The only drawback is if the DNS server fails to respond your site is boned. Making DNS Failure proof is rather easy-which is another post entirely.
For my solution, I am using DNS. This allows me to recover servers in a shard on different boxes without having to change code. This also allows my environment to recover from blocked I/O events that can spike load on the front-end that make requests slow.
- No Route to Host
- INNODB recovery
- Disk Fails (disk fills up)
- Switch Fails / Flaps
- Plug-in storage engine crashes, yet mySQL is up
- DNS Resolve fails
Here are some conditions where connections take to long:
If you have a better method, or want to advise me on flaws I should watch out for, drop a comment.
2 comments:
OMFG, He rox!
I like the idea. On other thing you could check is the number of long running queries. Sometimes really simple SELECTs here hang forever, more and more pile up etc. We have a non-stampede mechanism in our memcached, but it would be cool if the server would turn itself off in these situations...
One other idea: why don't you put one TCP load balancer on each db machine, right in front of the db? Then you don't have additional SPOF issues, just your regular DB-failover mechanisms. You can even have it running on the default mySQL port.
Post a Comment