Wednesday, April 08, 2009

Recover from No-Route to Host or Storage Engine Plugin Crashes

There are a finite set of cases where mySQL clients will hang on a connection-for a small period of time (seconds) or indefinitely. Most notably no-route to host causes a timeout to occur which in most clients are on the order of seconds to recovery.

In a web environment if a database is connected to on the fly, each connection made should take milliseconds-when the net is healthy. However, when a database server crashes a timeout for each connection takes seconds and there are cases where connections can just hang (recovery of INNODB as an example). For instance have you ever had a ssh session hang and a kill (SIGHUP..) does not work? This happens for mySQL client connections too.

I want the front ends to recover gracefully from a no-route to host, or more notably a hanged connection condition. I want to avoid that damn timeout all together on stateless connections. Why? Because if you use up all your worker httpd threads requests Fail-things crash, bosses get mad, everyone wakes up and you have a meeting about it later. So what are ways around this?

Use a Load balancer
This is cool and all, but is expensive since two or three are needed for HA purposes. In addition, this increases the complexity of managing servers, and most LBs are optimized for HTTPD traffic, not raw TCP traffic.

Use a memcache layer to keep DB state

This is cool as well, but the state server, which collects the state of your servers and updates memcache, is now a single point of failure-and a lot of time is spent to make this resilient. Not to mention all clients across Programming languages (bash scripts, Java, PHP, Python, etc) need to have access to this memcache layer and logically handle the connection the same. This is doable, but there is an easier way.

IRON DNS is a term I use to build an HA Resolver. IF a box fails a health check, a nagios event handler can tell IRON DNS to set the domain names IP address to or a routable interface that produces a Connection Refused (111 instead of 110).

For an internal network, all database entries should be able to fit in memory. All resolves should take less then 2ms, and updates to DNS entries can happen in just a few seconds across your entire farm. The only drawback is if the DNS server fails to respond your site is boned. Making DNS Failure proof is rather easy-which is another post entirely.

For my solution, I am using DNS. This allows me to recover servers in a shard on different boxes without having to change code. This also allows my environment to recover from blocked I/O events that can spike load on the front-end that make requests slow.

    Here are some conditions where connections take to long:
  • No Route to Host
  • INNODB recovery
  • Disk Fails (disk fills up)
  • Switch Fails / Flaps
  • Plug-in storage engine crashes, yet mySQL is up
  • DNS Resolve fails

If you have a better method, or want to advise me on flaws I should watch out for, drop a comment.


Anonymous said...

OMFG, He rox!

Jan said...

I like the idea. On other thing you could check is the number of long running queries. Sometimes really simple SELECTs here hang forever, more and more pile up etc. We have a non-stampede mechanism in our memcached, but it would be cool if the server would turn itself off in these situations...

One other idea: why don't you put one TCP load balancer on each db machine, right in front of the db? Then you don't have additional SPOF issues, just your regular DB-failover mechanisms. You can even have it running on the default mySQL port.