Thursday, November 09, 2006

Unorthodox approach to database design Part 2:Friendster

Diagram of Friendster backend

Above is a diagram of the Friendster backend at the time I was there. The entire presentation can be found here at mysqluc.com. Due to all the turmoil at Friendster, even with the coolest and most competent VP of Engineering in the business Jeff Winner we could not push my intended design. This design is now used at Flickr with huge success.

What I wanted to do was introduce a system that allowed Friendster to partition the user data. This was the same system that I started to implement at AuctionWatch in 1999 and was not able to complete.

Each 64bit AMD server would house 500K distinct users and all their data. I provided a system to increase capacity, scale linearly on a fraction of the servers, at the fraction of the cost. There would be no need to cache any data, and everything could be fetched real time. In fact when I was discussing this with Jeff who loved the idea and saw the potential in walks Kent the CFO (now the 5th CEO) and sees 500K users on the white board. So, Kent goes hey what's a Sook user? We all chuckled and that became the project name, project 500k.

Now here was my argument. What Friendster was using was simple old replication. Replication is a batch oriented system, when SQL is applied on the master it's logged to a file, slaves then read that file (think of tailf) then takes the SQL that is written and applies that same SQL on a slave server. When the application runs out of read capacity, the solution is to add more slave servers. In this model there is an inherent limitation. Below is a list of some:

  1. There is a single point of failure on the master

  2. ALL write SQL operation done to the master must be replicated to the slaves

  3. When IO bandwidth is low replication lags, causing slave lag



Now many people to get around these issues build other masters to take over or come up with clever tricks in the application to get around these problems.

I wanted to solve the issue without a trick. What was very expensive for Friendster was the user generated data. There was a crap load of it, about 500GB of it at the time. Replicating all that data was the cause of my heart-ache. Since the application was very dynamic and required huge ranges, I/O bandwidth was cut on every range query causing replication lag.

Project S00k solved these issues.

    Here was the argument:
  • Split up user data


With a small amount of data web developers can do full table scans at a fast rate with a high concurrency if the data is small enough. So, bad code could go out and not take down the site.
On top of that, instead of replicating to an INNODB datafile that is 500GB that is very expensive to support by having multiple copies of, let’s only deal with a datafile or database size that is a few gigs in size.
Why is this better?

  • The data is faster to backup

  • The data is faster to recover

  • The data can fit into memory

  • The data is easier to manage


Splitting up the user data so that User A exists on one server while User B exists on another server, each server now holds a shard of the data in this federated model.


  • Provide High Availability


A problem at Friendster was if the master goes down, the site goes down. To solve this:

  • Take each S00K Shard and put it in a replication ring. This is now a backup of the Shard of data.

  • If one server out of the Shard pair goes down so what keep writing to the other.

  • If both go down only a small amount of users are affected, show them the outage notification while other users are still able to use the site.



  • Provide more write bandwidth


  • In single-master many-slave architecture write bandwidth is throttled, in federation you can write to all the masters and read from them.


  • Really use replication for what it's design is for


  • Replication is a batch system it’s IMHO designed for backup of the database. Each server of the shard replicates it’s data to it’s pair. To remove effects of replication lag, stick the shard viewer to a single server in the ring. All operations are done there. So, if there is replication lag, or replication is broken users would not notice. Read and write to the same place.


    But, because of the environment in Friendster getting this done was not possible. So, I worked on the same old system and we where never able to solve the root of the capacity issues.


    Next: Flickr 5 min talk about this design and we start to work on implementing it.

    Monday, November 06, 2006

    IO schedulers matter

    I've done a multitude of benchmarks using various 2.6 IO schedulers. Hands down the Deadline I/O scheduler is the best for INNODB traffic or RANDOM IO. I use to have all this benchmark information in excel worksheets, but lost it when I left Friendster.

    Here is how to figure out what IO scheduler your using in Linux 2.6

    dmesg |grep -i sched

    In most cases your probibly set up to use the cfq io scheduler. Change it to deadline in your PXE, lilo, or grub settings.


    For example:

    # For booting GNU/Hurd
    title GNU/Hurd
    root (hd0,0)
    kernel /boot/gnumach.gz root=hd0s1
    module /boot/serverboot.gz

    append to it
    kernel /boot/wtfe root=wtfe elevator=deadline

    Thanks Peter N. for the linux config info!