mySQL DBA, Architecture, Dev, Scale, HA, Code : October 2011

Tuesday, October 25, 2011

Handling the Hockey Stick Growth

The term hockey stick is used to describe the effect of an app that suddenly goes viral. Take a look at the graph to the left. There is modest growth and suddenly, the app goes viral and takes off. It looks like a hockey stick, ala the term.

This article is briefly going to touch on the points of how to handle the sudden growth at the lowest cost possible with a site that I helped build: schoolFeed.com-a social network that reconnects classmates for Free.

The main features of schoolfeed.com is reconnecting classmates, ensuring that each classmate is well connected, a feed to keep classmates in touch with one another and interests that each classmate share. Additionally there is a photo experience to share with your online yearbook and more features to come.

To handle the growth, enable rapid feature development, keep the site up without waking me up, and keeping it cheap means a set of structure needs to be put in place.

Suggestion #1: Keep the system architecture simple

The architecture consists of PHP on the front end, Memcache to front database queries, a database on the backend, a queue service-Gearman to handle offline processing in parallel; finally sendgrid to handle mail.

Suggestion #2: Keep the development environment simple

The development environment did not start off too abstracted. A simple MVC model is used where the Model fronts the PDO database objects structure. The Controller is the service layer which is a Front Controller design pattern, and the php entry points to handle the model inputs. The View is in smarty because keeping the presentation layer separate from the business logic is pivotal. Additionally this View is separated enough to replace smarty and or internationalize the strings in the future. Also JQUERY is used to make life so much simpler when supporting IE.

Suggestion #3: Monitor everything

I use Nagios for alerting (Icinga), Ganglia for Trending, and a custom stat system backed by mySQL for reporting on code interrupts, click through rates, feature adoption, K-Factor, DAU per feature, MAU, WAU, Facebook Platform Health, site response time, site api response time, email send rate.

Suggestion #4: Design every layer to be distributed.

If I run out of apache threads, I add more www servers. If my memcache eviction rate is to high, I add more memcache servers. If I need more database transactions per second I add more database servers and each layer is controlled from a config file enabling rapid deployment of servers to handle spikes in traffic. Since the database connection logic is controlled by the application, I drop a definition in an array and new traffic starts hitting a new database server. If the existing database server is loaded to much and I need to move data off of it. I take a xtrabackup of the server replicate it to a new server, set the pointer for a % of that traffic to the new server and clear up the old data on the original server. Or I can migrate individual entities. An entity is a user/school/interest/url/facebook id/etc.

Suggestion #5: Don't optimize to soon.

The goal is to make each feature super fast, but building a super abstracted layer to support 1000s of devs is only necessary when you have 10s of devs :). Please don't interprete this as me advocating being sloppy-I'm saying its cool to allow your team to interact with SQL and write their own :). Additionally building custom servers to handle specific tasks, changing languages to get a specific feature is really not necessary in the beginning. Supporting the product and ensuring the features do not take more then 200ms to generate or weeks to build said feature. This should be the focus to enable the hockey stick. In the early stage of a hockey stick; technology rarely is the cause for the growth-its building what your users may want and when your wrong throw that stuff away and actually build what they want. A helpful tool is to build features in a way where the feature or parts of the feature can be turned it off with a config change. This will save you a ton of headache without having to take the site down, while enabling pushing code out quickly and watching to see if its adopting prior to optimizing.

Suggestion #6: Plan for things to break and set up procedures to handle outages

Things will break. The goal is to hide this fact from users or inconvenience them as little as possible. Schedule maintenance windows to fix the heavy stuff. Have a playbook to handle outages, if the play does not exist-write the play down. Finally automate common tasks. Remember if you don't want any user experiencing an outage-that costs a lot of money. Redundancy is expensive. Multiple Redundancy in multiple datacenters is even more expensive.

I hope these steps help you in your projects in the future. I have had the pleasure of handling multiple hockey sticks and following a basic rule/suggestion set has helped me each time. The end goal really is to give a great experience for your users, build a clean environment for your devs with your devs input and improve the product rapidly.

Some stats: 3-5 web servers, 2 job boxes 2 database servers we are able to handle well over 100K DAU.

Thursday, October 20, 2011

Facebook should launch FBCloud and compete directly with Amazon, Google and others

Another Facebook should do post by someone outside of Facebook; but it’s a moneymaker that Facebook has not tried and probably has the best chance of succeeding at (not like deals har har - jab, jab). Some of the best DEV Ops work at Facebook. Facebook knows scale. Facebook knows system management. Facebook built the most advance data-center on the planet. This stuff is great it shows that they can do it, but what’s the motivation for the app developer to deploy in a Facebook Cloud? Simply put Latency. This is the real issue, for me, really a selfish desire to have my app move as fast as Facebook's Apps while using Facebook Data; a seamless integration if you will. For Facebook, it’s good because they can help me make my users happier while making tons of Cash. If my app is in the same data-center as the center of data, my app can move faster thus giving my users a better experience.

Here is an example. When doing a graph call for Facebook friends, the backend systems can do it in ms time yet the JSON reaches the caller in the 100ms time frame over the WAN from my servers in EC2-west1c to Facebook Servers in Oregon. If I'm in the data-center that holds the data (Oregon) my app speeds up 10 times, since that 100ms R(t) turns to 5-10ms.

Additionally Facebook houses some of the most advance tech that lots of people around the web use. Such as MEMCACHE. Facebook could manage that for you. In fact they have PETA BYTES of memory for their own app with automatic key management between DCs (wow). Offer Facebook Hosted MySQL with Flash Cache for High density IOPS. Each Facebook DB server has Solid State Disks, use that to buffer IOPS for subscribed developers to FBCloud. With their tools to automatically migrate data to another server, building new Instances would be a snap without having to use a SAN. Although you could use SSDs to buffer SAN writes/reads for easier management with great R(t). Facebook stats on HBASE, Facebook Varnish, a fast CDN-everything that they do as commercial product. I've seen their tools, its better then enterprise quality and nearly all of them have an API. Facebook Culture is platform focused. I assume if you don't build an API for your tool your mocked.

How would Facebook make money? Charge on CPU resources just like Amazon. Charge on IOPS, charge on managed Memcache size, charge on Data size. Charge for BCP. With this adding to Facebook's platform, Facebook could make money on the Front End from Ads, on currency and finally on the API indirectly while giving the End User an entire platform guaranteed to be fast and redundant in multiple data-centers.

I have an entire vision that would make a ton of cash, but really would make my users and me as a developer happier.

PS This post went out to fast, with grammar and spelling mistakes. Should be fixed now, my apologies.