Tuesday, November 22, 2011

Using live code interrupts to produce stats which in turn improves code

How do you know that your code is fast? Is it fast for your test cases or is it fast for every case? When changes are made how does that affect your customers? How do you know over a period of time if the system is faster or slower.

The same stat system which is used to track new installs, viral clicks, impressions, rates,  funnels, page flows, gauges, counters, etc is also used to let me know how fast code blocks are performing. How is this done?

Since a Front Controller Design Pattern is used for my AJAX calls, I am able to wrap all calls in time deltas to produce a centralized stat on how fast all service calls are taking. So, each code change that is pushed; I can see if that code change slowed something down, or broke something all together.

Here is the setup. Each Service Call Response Time is placed into buckets; less than 200ms, 201ms to 500ms, 501ms to 1 second, 1.001 second to 2 seconds and greater then 2 seconds. Anything over 500 ms is bad.


On the 13th of November I see that service calls for the 2 second+ bucket are on the rise. Now the next step is to look at independent data sources to determine is this a system issue or a code issue. This is the tricky part because code issues can cause system issues. I use ganglia to separate out the system stats from code interrupt stats and interpret the results.
 
Now the System stats I look at to help me drill down to the issues are things that  encapsulate the problem that I am investigating.  For instance, service calls are a product of CPU and Memory usage from the www teir. These system calls use network resources that talk to databases, memcache and queues. Thus I ask myself what are some high level system metrics to help me figure out if there is a system issue? Well apache has a lot of high level stats like busy workers and requests per second. Busy workers are a product of memory, cpu and network resources. This is a good stat.




Here busy workers are seen to be increasing but not all that much. What is the slow down? Let's look at another ganglia stat - Requests per Second.




Wow the request per second is going up, Busy (above) is growing slightly; with hiccups - thus what I need to look at is the data tier since the wwws are scaling well (Higher request rates with the same www busy rate).

Unfortunately the graphs for that are plagued with huge spikes caused by long term issues in RRD, so they are not show-able. A long story short, an increase in requests are from a sudden spike of new users and returning users; causing more concurrency on the backend exposing memcache evictions because of more active users. Thus I added more memcache servers and scheduled a rebalance of database data onto new servers; this should lower overall the service times on average without code changes.

In summary, realtime code insights are just as valuable as knowing the number of installs or clicks from emails. You can do all sort of stuff like draw custom dots for code deployments that correlate with response times. Having this data is invaluable to keep your site fast at very little cost while giving you knowledge of the entire system.

No comments: