At Flickr we use a Federated Architecture, where the WHOLE of the data is spread across many servers. This means that when you load a photo page the dynamic data generated is hitting a server that holds that data, while another photo page view may hit another server for content.
Now to do complex operations to control the flow of what operates on data, the complex operation must lock the data. Let's call this a global lock, which is used to coordinate many jobs. Now to make these operations fast, lets run many processes across many servers. Essentially we needed to solve a hard problem in parallel computing: distributed locking.
This took a bunch of dedicated time to get done properly. We solved the issue by using 3 database clusters.
First I used the coolest mysql feature out there; GET_LOCK and IS_FREE_LOCK. These two mysql function are the heart of the distributed lock.
Why use them both?
GET_LOCK will lock a string in MEMORY on mysql, and anything else trying to get a lock on that string will have to wait until TIMEOUT is reached.
Since the TIMEOUT is in seconds, I don't want to wait and slow down everything, so we call IS_FREE_LOCK first which will return in a fraction of a ms
How do we use it?
We lock an entire Account classification by setting the lock on the owners Cluster.
An account classification is defined as ACCOUNT_123_TASK - which is the string for the lock, this is hashed to the same SERVER where the account is located.
Once the lock is established check another cluster to ensure that we have not performed the action already-this ensures that the event that entered the system doesn't corrupt data if the message to process made it into the system more then once.
Finally operate on the data, if any of the operations fail ROLLBACK all changes, log and continue.
The cool thing about this is that the locks and data are all in the same place. So, if the lock is not achievable then the data is unaffected.