Thursday, April 23, 2009

Restoring a backup of a Circular Replication config

So say you have two mySQL servers called A and B. A and B are in a circular replication ring meaning

A replicates from B
B replicates from A

In addition, log-slave-updates is on, so if one server of either A or B goes down you can recover all the data from either side.

Therefore, here are some steps.
Take a snapshot of a server in the ring, with your favorite backup tool (mine is ibbackup)

When either server A or B dies, restore from said backup. Here are the steps. Let us assume server A goes down, and backups are done from server B.

restore the backup of server B to server A

On server-A turn off log-slave-updates (you will see why on the next line)
Add replicate-same-server-id on server A to my.cnf

CHANGE MASTER TO to the position and file of the binary log that is reported from your tool on server-A.

IF the binary log did not get corrupted on server-B your cool, else flip the binary log to the next binary log from server-A on server-B.

Let server-A catch up
Stop server-A
Add log-slave-updates on server-A
remove replicate-same-server-id on server A
Start server-A

done.

Wednesday, April 08, 2009

Recover from No-Route to Host or Storage Engine Plugin Crashes

There are a finite set of cases where mySQL clients will hang on a connection-for a small period of time (seconds) or indefinitely. Most notably no-route to host causes a timeout to occur which in most clients are on the order of seconds to recovery.

In a web environment if a database is connected to on the fly, each connection made should take milliseconds-when the net is healthy. However, when a database server crashes a timeout for each connection takes seconds and there are cases where connections can just hang (recovery of INNODB as an example). For instance have you ever had a ssh session hang and a kill (SIGHUP..) does not work? This happens for mySQL client connections too.

I want the front ends to recover gracefully from a no-route to host, or more notably a hanged connection condition. I want to avoid that damn timeout all together on stateless connections. Why? Because if you use up all your worker httpd threads requests Fail-things crash, bosses get mad, everyone wakes up and you have a meeting about it later. So what are ways around this?

Use a Load balancer
This is cool and all, but is expensive since two or three are needed for HA purposes. In addition, this increases the complexity of managing servers, and most LBs are optimized for HTTPD traffic, not raw TCP traffic.

Use a memcache layer to keep DB state

This is cool as well, but the state server, which collects the state of your servers and updates memcache, is now a single point of failure-and a lot of time is spent to make this resilient. Not to mention all clients across Programming languages (bash scripts, Java, PHP, Python, etc) need to have access to this memcache layer and logically handle the connection the same. This is doable, but there is an easier way.

Use IRON DNS:
IRON DNS is a term I use to build an HA Resolver. IF a box fails a health check, a nagios event handler can tell IRON DNS to set the domain names IP address to 127.0.0.1 or a routable interface that produces a Connection Refused (111 instead of 110).


For an internal network, all database entries should be able to fit in memory. All resolves should take less then 2ms, and updates to DNS entries can happen in just a few seconds across your entire farm. The only drawback is if the DNS server fails to respond your site is boned. Making DNS Failure proof is rather easy-which is another post entirely.

For my solution, I am using DNS. This allows me to recover servers in a shard on different boxes without having to change code. This also allows my environment to recover from blocked I/O events that can spike load on the front-end that make requests slow.

    Here are some conditions where connections take to long:
  • No Route to Host
  • INNODB recovery
  • Disk Fails (disk fills up)
  • Switch Fails / Flaps
  • Plug-in storage engine crashes, yet mySQL is up
  • DNS Resolve fails


If you have a better method, or want to advise me on flaws I should watch out for, drop a comment.