Tuesday, November 22, 2011

Using live code interrupts to produce stats which in turn improves code

How do you know that your code is fast? Is it fast for your test cases or is it fast for every case? When changes are made how does that affect your customers? How do you know over a period of time if the system is faster or slower.

The same stat system which is used to track new installs, viral clicks, impressions, rates,  funnels, page flows, gauges, counters, etc is also used to let me know how fast code blocks are performing. How is this done?

Since a Front Controller Design Pattern is used for my AJAX calls, I am able to wrap all calls in time deltas to produce a centralized stat on how fast all service calls are taking. So, each code change that is pushed; I can see if that code change slowed something down, or broke something all together.

Here is the setup. Each Service Call Response Time is placed into buckets; less than 200ms, 201ms to 500ms, 501ms to 1 second, 1.001 second to 2 seconds and greater then 2 seconds. Anything over 500 ms is bad.

On the 13th of November I see that service calls for the 2 second+ bucket are on the rise. Now the next step is to look at independent data sources to determine is this a system issue or a code issue. This is the tricky part because code issues can cause system issues. I use ganglia to separate out the system stats from code interrupt stats and interpret the results.
Now the System stats I look at to help me drill down to the issues are things that  encapsulate the problem that I am investigating.  For instance, service calls are a product of CPU and Memory usage from the www teir. These system calls use network resources that talk to databases, memcache and queues. Thus I ask myself what are some high level system metrics to help me figure out if there is a system issue? Well apache has a lot of high level stats like busy workers and requests per second. Busy workers are a product of memory, cpu and network resources. This is a good stat.

Here busy workers are seen to be increasing but not all that much. What is the slow down? Let's look at another ganglia stat - Requests per Second.

Wow the request per second is going up, Busy (above) is growing slightly; with hiccups - thus what I need to look at is the data tier since the wwws are scaling well (Higher request rates with the same www busy rate).

Unfortunately the graphs for that are plagued with huge spikes caused by long term issues in RRD, so they are not show-able. A long story short, an increase in requests are from a sudden spike of new users and returning users; causing more concurrency on the backend exposing memcache evictions because of more active users. Thus I added more memcache servers and scheduled a rebalance of database data onto new servers; this should lower overall the service times on average without code changes.

In summary, realtime code insights are just as valuable as knowing the number of installs or clicks from emails. You can do all sort of stuff like draw custom dots for code deployments that correlate with response times. Having this data is invaluable to keep your site fast at very little cost while giving you knowledge of the entire system.


Anonymous said...

You should checkout NewRelic. You can get similar data to ganglia, and much more. You get production level profiling that can tell you how much time was spent in individual methods, querying the db, memcache, external calls. It also has a client side js you can hookup to let you know the average page response time and can break it down by country and all kinds of other cool things.

There is a stripped down free version, but the pay version is definitely worth checking out.


Streeter said...
This comment has been removed by the author.
Streeter said...

What stats system are you guys using for the top graphs? The bottom one is ganglia but I'm not familiar with the top.

Dathan Vance Pattishall said...

Vinay wrote the graphing software which is based off of jquery.flot its very custom

Code interrupts are just tags written to mySQL concatenated for every 5 min block