Capacity planning is key to making sure your site can serve the requests to users. If the site is slow, or down that is loss revenue in any revenue model used to monetize your product.
How to determine when you need to add more memcache servers.
The stats I look at are system stats and memcache stats.
Memcache is Memory / network heavy. CPU spikes are very low, and if the CPU starts maxing out that is probably due to some sort of network driver issue or huge context switching or large values stored in memcache.
So on the system side I look at vmstat
[root@memcached1 ~]# vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 208 207400 41452 50552 0 0 10 198 0 0 2 9 87 2
0 1 208 207408 41452 50552 0 0 0 81 15617 17671 3 13 84 0
0 0 208 207352 41452 50552 0 0 0 56 15508 17514 3 13 84 0
1 0 208 207248 41452 50620 0 0 0 310 15295 16762 3 12 84 0
0 0 208 207248 41452 50620 0 0 0 31 15512 17167 2 13 84 0
0 0 208 207256 41452 50620 0 0 0 3 15925 18214 3 14 84 0
0 0 208 207264 41452 50620 0 0 0 0 15456 16923 3 13 85 0
0 0 208 207264 41452 50620 0 0 0 213 15782 17604 3 13 84 0
0 0 208 207264 41452 50620 0 0 0 40 15860 18036 2 13 84 0
2 0 208 207272 41452 50620 0 0 0 214 15926 18248 3 14 84 0
0 0 208 207288 41452 50620 0 0 0 77 15781 17617 3 13 84 0
This server dedicated to memcache. The context switching is huge due to all of the constant requests-but we are talking about modern day CPU's which can context switch like crazy. The thing that is bugging me is that requests are starting to go into the run queue, not at a alarming rate but still this is an indication of some possible issue.
This is something that is graphed on ganglia. If the run queue on average starts increasing, there is some problem.
Next stats from memcache.
/**
* @desc get extended status from all servers
*/
public function CacheGetStats(){
if ($GLOBALS[cfg][disable_feature_memcache]){
return true;
}
return $this->memcache_obj->getExtendedStats();
}
I have a class called Cache which is a wrapper around memcache class calls. Cal Henderson would kill me if I was using classes at Flickr. Don't get me wrong I agree with Cal 100% but the environment I am in now requires classes-so I have to use it. The reason why we don't like classes is for another post.
So the output.
[pid] => 17696
[uptime] => 2748911
[time] => 1221850214
[version] => 1.2.2
[pointer_size] => 64
[rusage_user] => 135944.231335
[rusage_system] => 420733.419798
[curr_items] => 6012187
[total_items] => 2362145406
[bytes] => 4737438938
[curr_connections] => 654
[total_connections] => 4128179078
[connection_structures] => 7293
[cmd_get] => 12681552588
[cmd_set] => 2362145408
[get_hits] => 9880855733
[get_misses] => 2800696855
[evictions] => 0
[bytes_read] => 2564412782739
[bytes_written] => 12893067371405
[limit_maxbytes] => 5242880000
[threads] => 4
Notice on this server we have a good hit rate and no evictions. Yet looking at one server is not good enough, look at them all- the reason more memcache servers means more memory to store data for your application. The CRC32 hash that the PHP memcache client uses is not very even and some keys may be requested more.
[pid] => 13956
[uptime] => 4228079
[time] => 1221850213
[version] => 1.2.2
[pointer_size] => 64
[rusage_user] => 268369.193681
[rusage_system] => 711491.537845
[curr_items] => 5219411
[total_items] => 3686853272
[bytes] => 4751658935
[curr_connections] => 675
[total_connections] => 4154000955
[connection_structures] => 9981
[cmd_get] => 19489963453
[cmd_set] => 3686853275
[get_hits] => 15062084538
[get_misses] => 4427878915
[evictions] => 11210410
[bytes_read] => 3908139025173
[bytes_written] => 10744393525089
[limit_maxbytes] => 5242880000
[threads] => 4
Take a look at this server. The evictions are high, indicating that memcache needs to make room for new objects. This is not good, its an indication that the LRU is evicting objects out faster then their expire time. Additionally the memcache gets are much greater then the hits. This is an indication that memcache is not really working as good as it can.
But one server is not an indication that there is a problem. Looking at the system as a whole is to determine if a problem exists. My rule of thumb is if the 30-40% of the servers have a high eviction rate, its time to add 30-40% more servers or memory.
Now allot of this can be tuned by changing the slab size, but learning from John Allspaw, don't make a plan based on a possible gain, make a plan based on the current usage. Then if the possible gain works your golden.
How do you base your stats on adding more memcache servers?

