Tuesday, December 11, 2007

Distributed Locking Solving it with mySQL

At Flickr we use a Federated Architecture, where the WHOLE of the data is spread across many servers. This means that when you load a photo page the dynamic data generated is hitting a server that holds that data, while another photo page view may hit another server for content.

Now to do complex operations to control the flow of what operates on data, the complex operation must lock the data. Let's call this a global lock, which is used to coordinate many jobs. Now to make these operations fast, lets run many processes across many servers. Essentially we needed to solve a hard problem in parallel computing: distributed locking.

This took a bunch of dedicated time to get done properly. We solved the issue by using 3 database clusters.

First I used the coolest mysql feature out there; GET_LOCK and IS_FREE_LOCK. These two mysql function are the heart of the distributed lock.

Why use them both?

GET_LOCK will lock a string in MEMORY on mysql, and anything else trying to get a lock on that string will have to wait until TIMEOUT is reached.

Since the TIMEOUT is in seconds, I don't want to wait and slow down everything, so we call IS_FREE_LOCK first which will return in a fraction of a ms


How do we use it?

We lock an entire Account classification by setting the lock on the owners Cluster.
An account classification is defined as ACCOUNT_123_TASK - which is the string for the lock, this is hashed to the same SERVER where the account is located.

Once the lock is established check another cluster to ensure that we have not performed the action already-this ensures that the event that entered the system doesn't corrupt data if the message to process made it into the system more then once.

Finally operate on the data, if any of the operations fail ROLLBACK all changes, log and continue.

The cool thing about this is that the locks and data are all in the same place. So, if the lock is not achievable then the data is unaffected.

5 comments:

Anonymous said...

If you do:

IS_LOCK_FREE(lock)

GET_LOCK(lock)

... you have a race condition, I'm pretty sure?

I think you mean:

GET_LOCK(lock, 0)

which will return instantly if the lock is lost, and is atomic.

Dathan Pattishall said...

I should of tested that. I thought that GET_LOCK(lock,0) would block, until the lock is free, but it does not. I will change this now.

Anonymous said...

You mean it's IS_FREE_LOCK as you have a typo by calling it IS_LOCK_FREE.

Dathan Pattishall said...

ha your right

Unknown said...

IS_LOCK_FREE and GET_LOCK works fine in a cluster enviroment?
I want to know if the get_lock get a lock over all servers in the cluster or only in the server who perform the query.
Thanks