Tuesday, October 25, 2011

Handling the Hockey Stick Growth

The term hockey stick is used to describe the effect of an app that suddenly goes viral. Take a look at the graph to the left. There is modest growth and suddenly, the app goes viral and takes off. It looks like a hockey stick, ala the term.

This article is briefly going to touch on the points of how to handle the sudden growth at the lowest cost possible with a site that I helped build: schoolFeed.com-a social network that reconnects classmates for Free.

The main features of schoolfeed.com is reconnecting classmates, ensuring that each classmate is well connected, a feed to keep classmates in touch with one another and interests that each classmate share. Additionally there is a photo experience to share with your online yearbook and more features to come.

To handle the growth, enable rapid feature development, keep the site up without waking me up, and keeping it cheap means a set of structure needs to be put in place.

Suggestion #1: Keep the system architecture simple
The architecture consists of PHP on the front end, Memcache to front database queries, a database on the backend, a queue service-Gearman to handle offline processing in parallel; finally sendgrid to handle mail.
Suggestion #2: Keep the development environment simple
The development environment did not start off too abstracted. A simple MVC model is used where the Model fronts the PDO database objects structure. The Controller is the service layer which is a Front Controller design pattern, and the php entry points to handle the model inputs. The View is in smarty because keeping the presentation layer separate from the business logic is pivotal. Additionally this View is separated enough to replace smarty and or internationalize the strings in the future. Also JQUERY is used to make life so much simpler when supporting IE.
Suggestion #3: Monitor everything
I use Nagios for alerting (Icinga), Ganglia for Trending, and a custom stat system backed by mySQL for reporting on code interrupts, click through rates, feature adoption, K-Factor,  DAU per feature, MAU, WAU, Facebook Platform Health, site response time, site api response time, email send rate.
Suggestion #4: Design every layer to be distributed.
If I run out of apache threads, I add more www servers. If my memcache eviction rate is to high, I add more memcache servers. If I need more database transactions per second I add more database servers and each layer is controlled from a config file enabling rapid deployment of servers to handle spikes in traffic. Since the database connection logic is controlled by the application, I drop a definition in an array and new traffic starts hitting a new database server. If the existing database server is loaded to much and I need to move data off of it. I take a xtrabackup of the server replicate it to a new server, set the pointer for a % of that traffic to the new server and clear up the old data on the original server. Or I can migrate individual entities. An entity is a user/school/interest/url/facebook id/etc.

Suggestion #5: Don't optimize to soon.
The goal is to make each feature super fast, but building a super abstracted layer to support 1000s of devs is only necessary when you have 10s of devs :). Please don't interprete this as me advocating being sloppy-I'm saying its cool to allow your team to interact with SQL and write their own :). Additionally building custom servers to handle specific tasks, changing languages to get a specific feature is really not necessary in the beginning. Supporting the product and ensuring the features do not take more then 200ms to generate or weeks to build said feature. This should be the focus to enable the hockey stick. In the early stage of a hockey stick; technology rarely is the cause for the growth-its building what your users may want and when your wrong throw that stuff away and actually build what they want. A helpful tool is to build features in a way where the feature or parts of the feature can be turned it off with a config change. This will save you a ton of headache without having to take the site down, while enabling pushing code out quickly and watching to see if its adopting prior to optimizing.

Suggestion #6: Plan for things to break and set up procedures to handle outages
Things will break. The goal is to hide this fact from users or inconvenience them as little as possible. Schedule maintenance windows to fix the heavy stuff. Have a playbook to handle outages, if the play does not exist-write the play down. Finally automate common tasks. Remember if you don't want any user experiencing an outage-that costs a lot of money. Redundancy is expensive. Multiple Redundancy in multiple datacenters is even more expensive.
I hope these steps help you in your projects in the future. I have had the pleasure of handling multiple hockey sticks and following a basic rule/suggestion set has helped me each time. The end goal really is to give a great experience for your users, build a clean environment for your devs with your devs input and improve the product rapidly.

Some stats: 3-5 web servers, 2 job boxes 2 database servers we are able to handle well over 100K DAU.


Mark said...

Which database are you using that allows you to just drop in a new database and have it 'just work'?

Dathan Vance Pattishall said...

Xtradb with logic in the application layer to handle balancing and migration of data. Really nice shard structure with cool logic for fan out data pulls.