Friday, September 19, 2008

How do you know when you need more memcache servers?

Let me first start of with the disclaimer, that I do not use memcache to scale, I use it to reduce latency. I'm of firm belief that the database layer should be able to handle the requests, while memcache is used to keep frequent requests returning in a consistent time frame i.e. reduce I/O spikes.

Capacity planning is key to making sure your site can serve the requests to users. If the site is slow, or down that is loss revenue in any revenue model used to monetize your product.


How to determine when you need to add more memcache servers.

The stats I look at are system stats and memcache stats.

Memcache is Memory / network heavy. CPU spikes are very low, and if the CPU starts maxing out that is probably due to some sort of network driver issue or huge context switching or large values stored in memcache.

So on the system side I look at vmstat


[root@memcached1 ~]# vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 208 207400 41452 50552 0 0 10 198 0 0 2 9 87 2
0 1 208 207408 41452 50552 0 0 0 81 15617 17671 3 13 84 0
0 0 208 207352 41452 50552 0 0 0 56 15508 17514 3 13 84 0
1 0 208 207248 41452 50620 0 0 0 310 15295 16762 3 12 84 0
0 0 208 207248 41452 50620 0 0 0 31 15512 17167 2 13 84 0
0 0 208 207256 41452 50620 0 0 0 3 15925 18214 3 14 84 0
0 0 208 207264 41452 50620 0 0 0 0 15456 16923 3 13 85 0
0 0 208 207264 41452 50620 0 0 0 213 15782 17604 3 13 84 0
0 0 208 207264 41452 50620 0 0 0 40 15860 18036 2 13 84 0
2 0 208 207272 41452 50620 0 0 0 214 15926 18248 3 14 84 0
0 0 208 207288 41452 50620 0 0 0 77 15781 17617 3 13 84 0


This server dedicated to memcache. The context switching is huge due to all of the constant requests-but we are talking about modern day CPU's which can context switch like crazy. The thing that is bugging me is that requests are starting to go into the run queue, not at a alarming rate but still this is an indication of some possible issue.

This is something that is graphed on ganglia. If the run queue on average starts increasing, there is some problem.


Next stats from memcache.


/**
* @desc get extended status from all servers
*/
public function CacheGetStats(){

if ($GLOBALS[cfg][disable_feature_memcache]){

return true;

}

return $this->memcache_obj->getExtendedStats();
}



I have a class called Cache which is a wrapper around memcache class calls. Cal Henderson would kill me if I was using classes at Flickr. Don't get me wrong I agree with Cal 100% but the environment I am in now requires classes-so I have to use it. The reason why we don't like classes is for another post.

So the output.

[pid] => 17696
[uptime] => 2748911
[time] => 1221850214
[version] => 1.2.2
[pointer_size] => 64
[rusage_user] => 135944.231335
[rusage_system] => 420733.419798
[curr_items] => 6012187
[total_items] => 2362145406
[bytes] => 4737438938
[curr_connections] => 654
[total_connections] => 4128179078
[connection_structures] => 7293
[cmd_get] => 12681552588
[cmd_set] => 2362145408
[get_hits] => 9880855733
[get_misses] => 2800696855
[evictions] => 0
[bytes_read] => 2564412782739
[bytes_written] => 12893067371405
[limit_maxbytes] => 5242880000
[threads] => 4



Notice on this server we have a good hit rate and no evictions. Yet looking at one server is not good enough, look at them all- the reason more memcache servers means more memory to store data for your application. The CRC32 hash that the PHP memcache client uses is not very even and some keys may be requested more.



[pid] => 13956
[uptime] => 4228079
[time] => 1221850213
[version] => 1.2.2
[pointer_size] => 64
[rusage_user] => 268369.193681
[rusage_system] => 711491.537845
[curr_items] => 5219411
[total_items] => 3686853272
[bytes] => 4751658935
[curr_connections] => 675
[total_connections] => 4154000955
[connection_structures] => 9981
[cmd_get] => 19489963453
[cmd_set] => 3686853275
[get_hits] => 15062084538
[get_misses] => 4427878915
[evictions] => 11210410
[bytes_read] => 3908139025173
[bytes_written] => 10744393525089
[limit_maxbytes] => 5242880000
[threads] => 4



Take a look at this server. The evictions are high, indicating that memcache needs to make room for new objects. This is not good, its an indication that the LRU is evicting objects out faster then their expire time. Additionally the memcache gets are much greater then the hits. This is an indication that memcache is not really working as good as it can.

But one server is not an indication that there is a problem. Looking at the system as a whole is to determine if a problem exists. My rule of thumb is if the 30-40% of the servers have a high eviction rate, its time to add 30-40% more servers or memory.

Now allot of this can be tuned by changing the slab size, but learning from John Allspaw, don't make a plan based on a possible gain, make a plan based on the current usage. Then if the possible gain works your golden.

How do you base your stats on adding more memcache servers?

3 comments:

Anonymous said...

So, what's wrong with classes? Are you talking about the old days of PHP4 or are you saying classes are bad in PHP5 too?

Anonymous said...

Great article, thanks!

But as a PHP programmer I just can't ignore one thing - your missuse of PHP constants.

$GLOBALS[cfg][disable_feature_memcache] - this makes PHP search for two constants: cfg and disable_feature_memcache. And throws two notices, which are probably ignored. First, these should be used as strings (eg. $GLOBALS['cfg']['disable_feature_memcache']) becouse "it is the right way" :) and secondly, it reduces the overhead of PHP searching for something that does not exist. This way you will surely get less CPU spikes you talk about from your PHP code.

Oh, and yes, I will be looking forward to reading about your opinion why objects/classes in PHP are bad.

Good luck!

Dathan Pattishall said...

@sk
Yea I'm lazy with constants in this fashion-I'll fix that up.

OO PHP programming is hard to trace. If a framework is not established, its very easy for multiple coders to create a huge spaghetti mess of methods that do nested inheritance, side affect public vars, and create unnecessary overhead.

I measured that each instance of PEAR DB.php has a 4K overhead. Which is huge, and directly attributed to the object!

Now, I've seen some really good OOP PHP classes and its actually been a pleasure working in that environment.

What I found to be the easiest methods of coding with multiple developers is procedural coding in PHP.

file_name.inc
function names are

file_name_Function(vars){}

So, anytime you reference a function you know where to debug it immediately.