One of the main reasons why a dev may want to connect on the fly is because they have too many front ends to hold a persistent connection on the backends. Since mySQL does not use libevent, holding open threads to mySQL is much more costly. Threads == Memory.
But, that's here nor there. The main purpose of this post is to talk about how to recover from failed connections that block apache threads.
Common Failures:
No route to Host
Flapping NIC
Locked Tables
Recovering from a Crash
more of the same.
My Environment:
I have a bunch of webservers (200+) that all have 300 possible threads (60000 possible connections to a single DB) behind a load balancer that uses the LB least connections protocol to distribute load across all 200+ webservers.
Since PHP is used in a stateless mode (no guarentee that the same user will hit the same server), I have to have some way of telling all other apache procs for said box that a server is down. I use stateless on the fly connections, so each apache proc will test the connection.
Here in lies my most common problem. If a box in the backend dies, all apache threads will block for a predefined time-out.
In /etc/php.ini (it could be in /etc/php.d/mysql.ini) I set
mysql.connect_timeout = 5 // the default is 60
In my common_db class when connecting to a database, the connect routine returns a database handle object
if (PEAR::isError($dbh) && $delta_to_connect >= 5){
#
# mark ip as dead for 15 min
#
apc_store($ip, array('DEAD'), 900);
return false;
}
return $dbh;
Now PRIOR to calling the database connect code I check to see if the IP is up.
$status = apc_fetch($ip);
if ($status[0] === 'DEAD'){
return false;
}
... do connect ...
But, there could be a variety of issues that can call false positives, like network flaps, someone blocking the db for some time etc. So, I'll allow one request .1% of the time bypass the status check to try again.
if ($status[0] === 'DEAD' && (mt_rand(0, 1000) != 1)){
return false;
}
But, if your app can't connect to the DB aren't you down anyway?
Yes, although I don't have to restart all the httpds.
Also each DB has a redundant pair to when returning false, the app code will try the other set of servers.
I have a variety of methods to deal with these scenarios but this is the quickest to implement.
2 comments:
You could always go distributed with something like memcached and then have a global list of servers that are up/down at any given time, too :)
I tried that 1st. One because messing with the APC cache is scary.
Two because I love memcache-its sick I know.
But then my Pro APC side had this.
1. APC is much faster, no net latency.
2. Didn't want a single source of truth. If a box is down all tests across the cluster should have a consistent result. False positives can be a killer.
3. If the memcache server(s) that hold this info had a net bump-this could lock up the httpd server. Since the request rate is for every webserver x every function call that requires to talk to the db hitting a single memcache instance. This would still work, with some tricks of removing the SPOF, but then the SPOF is also in the setter. If that is unable to set then your dead.
With some more fiddling I'm sure it would work, but I thought what is the quickest way to stay up?
This is what I came up with.
Post a Comment