Thursday, November 09, 2006

Unorthodox approach to database design Part 2:Friendster

Diagram of Friendster backend

Above is a diagram of the Friendster backend at the time I was there. The entire presentation can be found here at mysqluc.com. Due to all the turmoil at Friendster, even with the coolest and most competent VP of Engineering in the business Jeff Winner we could not push my intended design. This design is now used at Flickr with huge success.

What I wanted to do was introduce a system that allowed Friendster to partition the user data. This was the same system that I started to implement at AuctionWatch in 1999 and was not able to complete.

Each 64bit AMD server would house 500K distinct users and all their data. I provided a system to increase capacity, scale linearly on a fraction of the servers, at the fraction of the cost. There would be no need to cache any data, and everything could be fetched real time. In fact when I was discussing this with Jeff who loved the idea and saw the potential in walks Kent the CFO (now the 5th CEO) and sees 500K users on the white board. So, Kent goes hey what's a Sook user? We all chuckled and that became the project name, project 500k.

Now here was my argument. What Friendster was using was simple old replication. Replication is a batch oriented system, when SQL is applied on the master it's logged to a file, slaves then read that file (think of tailf) then takes the SQL that is written and applies that same SQL on a slave server. When the application runs out of read capacity, the solution is to add more slave servers. In this model there is an inherent limitation. Below is a list of some:

  1. There is a single point of failure on the master

  2. ALL write SQL operation done to the master must be replicated to the slaves

  3. When IO bandwidth is low replication lags, causing slave lag



Now many people to get around these issues build other masters to take over or come up with clever tricks in the application to get around these problems.

I wanted to solve the issue without a trick. What was very expensive for Friendster was the user generated data. There was a crap load of it, about 500GB of it at the time. Replicating all that data was the cause of my heart-ache. Since the application was very dynamic and required huge ranges, I/O bandwidth was cut on every range query causing replication lag.

Project S00k solved these issues.

    Here was the argument:
  • Split up user data


With a small amount of data web developers can do full table scans at a fast rate with a high concurrency if the data is small enough. So, bad code could go out and not take down the site.
On top of that, instead of replicating to an INNODB datafile that is 500GB that is very expensive to support by having multiple copies of, let’s only deal with a datafile or database size that is a few gigs in size.
Why is this better?

  • The data is faster to backup

  • The data is faster to recover

  • The data can fit into memory

  • The data is easier to manage


Splitting up the user data so that User A exists on one server while User B exists on another server, each server now holds a shard of the data in this federated model.


  • Provide High Availability


A problem at Friendster was if the master goes down, the site goes down. To solve this:

  • Take each S00K Shard and put it in a replication ring. This is now a backup of the Shard of data.

  • If one server out of the Shard pair goes down so what keep writing to the other.

  • If both go down only a small amount of users are affected, show them the outage notification while other users are still able to use the site.



  • Provide more write bandwidth


  • In single-master many-slave architecture write bandwidth is throttled, in federation you can write to all the masters and read from them.


  • Really use replication for what it's design is for


  • Replication is a batch system it’s IMHO designed for backup of the database. Each server of the shard replicates it’s data to it’s pair. To remove effects of replication lag, stick the shard viewer to a single server in the ring. All operations are done there. So, if there is replication lag, or replication is broken users would not notice. Read and write to the same place.


    But, because of the environment in Friendster getting this done was not possible. So, I worked on the same old system and we where never able to solve the root of the capacity issues.


    Next: Flickr 5 min talk about this design and we start to work on implementing it.

    18 comments:

    Anonymous said...

    Interesting.

    I understand Livejournal split their users into different clusters of machines and that the biggest problem with this is bringing data together from the separate machines for the "friends" page (which contains entries from several users).

    How does your design cope with queries that need info from more than one shard?

    Tom

    veganloveburger said...

    LJ split their of data by user so that my user lives on a different partition than yours, but this scheme has partitioned it by type. Therefore, all users are in one scheme, messages, photos, testimonials in another.

    Thus bringing data together isn't a problem unless you're displaying different types of data, in which case I think it's relatively cheap to issue the appropriate query for each type & then sort them all together on the application side.

    chris larson said...

    For our solution it made much more sense to use replication. Over 80% of our queries where READs we wrote our statistics (page views, etc) to a totally different machine. So replicating data that isn't updated often is a solid solution.

    Chris - my two cents

    Dathan said...

    @ chris:

    What happens when your data consumes 800 GB of space?

    Would you replicate your data to many slaves?

    1TB slaves costs a lot of money and at best you can get maybe 1% of the data in memory.

    If the application hardly writes, will always stay small then replication is good enough.

    mojo jojo said...

    Interesting read.

    I've done some sharding of data across machines, but it was in a search engine and the storage engine was designed from scratch around it, so it was pretty easy to implement. And it could scale on cheap machines with small disks to huge capacities. This is maybe the best way to scale with cost in mind.

    I'm now thinking of splitting MySQL data across machines for another project, and one question comes to mind:

    Splitting by one table (users in this case) is obvious. But what about sharding on more than one axis? Suppose I have another huge table that I want to split, and it has to join with the users table. Off the top of my head, this sounds like it would make queries much much more complicated and make a whole mess out of the design. But I haven't really given it much thought yet.

    Have you done anything like that? What ae your thoughts?

    Anonymous said...

    very interesting, but one question.
    suppose now I have 10 mysql backends, i split my users to them. one year later, we should add 10 more backends, how could I split the old data on the previous 10 backends to the new 20 backends?

    Louis

    Mark said...

    Hmm... Quite an interesting read.

    jacksonh said...

    Very interesting post, I have to admit I am a little confused as to how you create data for a whole bunch of different users though. Like on a users friends page.

    Anonymous said...

    "Splitting by one table (users in this case) is obvious. But what about sharding on more than one axis? Suppose I have another huge table that I want to split, and it has to join with the users table."

    My guess is that this is why denormalization is mentioned prominently as one of the features of sharding -- you'd be merging those giant tables into one ginormous table, and then sharding it. You would have some considerable data redundancy in such a system, and you'd probably have to move relationship checking to code, not relations.

    n said...

    RE:"Splitting by one table (users in this case) is obvious. But what about sharding on more than one axis? Suppose I have another huge table that I want to split, and it has to join with the users table."

    What you are speaking to isn't necessarily a shard, but where you will get some real performance increase is when you split a table with many columns into two: one with the data that you query often, and the other with data that you don't. This will reduce the overall row size, thus faster to scan, sort, join or whatever. Along with selectively de-normalizing your data, this can make a big difference.

    Search Engine Optimization said...

    Aside from the complex physical connections that make up its infrastructure, the Internet is facilitated by bi- or multi-lateral commercial contracts (e.g., peering agreements), and by technical specifications or protocols that describe how to exchange data over the network. Indeed, the Internet is essentially defined by its interconnections and routing policies.

    As of December 30, 2007, 1.319 billion people use the Internet according to Internet World Stats. Writing in the Harvard International Review, philosopher N.J. Slabbert, a writer on policy issues for the Washington, D.C.–based Urban Land Institute, has asserted that the Internet is fast becoming a basic feature of global civilization, so that what has traditionally been called "civil society" is now becoming identical with information technology society as defined by Internet use. - web design company, web designer, web design india

    UK Contemporary Interior Design said...

    Interesting post the biggest problem with LJ is the friend page isnt it?


    ---------------------------
    Interior Designs Scotland

    Anonymous said...

    Still no response for the issue that requires queries across multiple shards?

    BTW, nice article, but your failure to utilize commas correctly forced me to read many sentences more than once to infer proper meaning. I suggest either getting an editor or reading this book:

    http://www.amazon.com/Eats-Shoots-Leaves-Tolerance-Punctuation/dp/1592400876

    Anonymous said...

    Web Art Sense a web design company is here for a purpose to make art and design of your imagination alive and real on the web platform. After several months of hunting of the best talent and struggling for the name of their design company they finally reached a vertical which engages their vision of togetherness. The best talented people in web designs have joined hands to compete and provide innovative, unique concepts to users/clients all across the globe. Web Art Sense is not any design company; they take real care of having your brands be presented that invokes your revelation to others who see it.

    Web Art Sense has their sales offices in London, New Jersey and New Delhi are heading towards marketing themselves to main continents of Europe, Australia and America. Web art sense already has ground foundation and is providing web based design development and SEO Services in countries like USA, UK, Australia and Italy. Web Art Sense has been growing above hundred percent every financial year since its foundation.
    Web Design Company


    Web Art Sense a leading web design company, affordable web design, web development, ecommerce, XHTML services. Web Design Company offers a complete collection of web design solutions for business. Web Art Sense team of professionals with proven knowledge in the field of web design & development are skilled of providing high quality, e commerce websites, development and website redesign & maintenance solutions.

    sex said...

    徵信社
    情趣用品
    情趣用品
    情趣用品
    情趣
    情趣


    SM
    充氣娃娃


    SM
    性感睡衣


    免費視訊聊天室
    aio交友愛情館
    愛情公寓
    情色貼圖
    情色文學
    情色小說
    情色電影
    情色論壇
    成人論壇
    辣妹視訊
    視訊聊天室
    情色視訊
    免費視訊

    免費視訊聊天
    視訊交友網
    視訊聊天室
    視訊美女
    視訊交友
    ut聊天室
    聊天室
    豆豆聊天室
    尋夢園聊天室
    聊天室尋夢園
    080聊天室
    080苗栗人聊天室
    女同志聊天室

    上班族聊天室
    小高聊天室






    免費視訊聊天
    免費視訊聊天室
    免費視訊
    ut聊天室
    聊天室
    豆豆聊天室 聊天室尋夢園
    影音視訊聊天室


    色情遊戲
    寄情築園小遊戲
    情人視訊網
    辣妹視訊
    情色交友

    成人論壇
    情色論壇
    愛情公寓
    情色
    色情聊天室
    色情小說
    做愛
    做愛影片
    性愛


    一葉情貼圖片區
    情趣用品


    辣妹視訊
    美女視訊
    視訊交友網
    視訊聊天室
    視訊交友
    視訊美女

    Anonymous said...

    A片,A片,成人網站,成人漫畫,色情,情色網,情色,AV,AV女優,成人影城,成人,色情A片,日本AV,免費成人影片,成人影片,SEX,免費A片,A片下載,免費A片下載,做愛,情色A片,色情影片,H漫,A漫,18成人

    a片,色情影片,情色電影,a片,色情,情色網,情色,av,av女優,成人影城,成人,色情a片,日本av,免費成人影片,成人影片,情色a片,sex,免費a片,a片下載,免費a片下載

    情趣用品,情趣用品,情趣,情趣,情趣用品,情趣用品,情趣,情趣,情趣用品,情趣用品,情趣,情趣

    A片,A片,A片下載,做愛,成人電影,.18成人,日本A片,情色小說,情色電影,成人影城,自拍,情色論壇,成人論壇,情色貼圖,情色,免費A片,成人,成人網站,成人圖片,AV女優,成人光碟,色情,色情影片,免費A片下載,SEX,AV,色情網站,本土自拍,性愛,成人影片,情色文學,成人文章,成人圖片區,成人貼圖

    情色文學,色情小說,色情,寄情築園小遊戲,AIO交友愛情館,情色電影,一葉情貼圖片區,色情遊戲

    言情小說,情色論壇,色情網站,微風成人,成人電影,嘟嘟成人網,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,微風成人區,成人網站,免費影片,色情影片,自拍,hilive,做愛,微風成人,微風論壇,AIO

    Priya said...

    Nice post dude, keep it up.
    Price India

    bala murugan said...

    we will add this story to our blog, as we have a audience in this sector that loves reading like this



    Website Designer Australia