Thursday, September 30, 2010

MongoDB the Definitive Guide by Kristina Chodrow and Michael Dirolf


The kind folks at O'Reilly sent me a fantastic book about MongoDB. This was a great read since it’s suited for people who do Operations and Development and Performance tuning (me). I've been using Cassandra for quite some time now (months lol) and the thing that has irritated me about Cassandra is the documentation for it. Cassandra documentation sucks, its hard to speed up on the internals. This MongoDB book is written by the most active participants that are developing MongoDB and the knowledge shows. What I like is it starts out on how to quickly get it up, add/get/update data to the DB. Then progresses to more advance topics-that talk about GridFS and MongoDB drivers. Personally I would like to see more elaboration of this facet in terms of motivation of why do this, what the win is and how it fits into the "Fast by Default" mantra. Each step is organized perfectly, and detailed with nice graphics that illustrate the document store or the flow of data from a systems view. When looking at the documentation on mongodb.org I see the same sort of clarity in this book. Comparing other NoSQL information, I do not see this transparency, which is rather frustrating because the learning curve is much larger. I'm so impressed with the info, and test results around the web that I'm moving to add this to my environment. Does this mean I'll get rid of my current Cassandra deployment? Probably not since its working great for my needs now.


Overall, great book, great info, intelligently presented with a straightforward explanation of how MongoDB works.

Friday, September 03, 2010

Cassandra and Ganglia

cassandra_tpstats_row_read_stage_completed

I finally got some time to do some house cleaning. One of my nagging low-hanging fruit jobs was to stop using jconsole as my monitor. I created a ganglia script to graph what is above. The image illustrated above I am showing all the Cassandra servers and their total row read stages completed in the last hour as a gauge. In essence I am graphing the delta of the change between ganglia script runs.

How I have it set up is:

All data exposed by JMX to produce tpstats and cfstats is graphed via ganglia. The pattern for each graph is as follows

cass_{stat_class}_{key}

stat_class - tpc, tpp, tpa means complete, pending, active respectively
key - would be message deserialization for instance.

For column family stats I graph the keyspace stats as well as the specific column family stats exposed by cfstats. For instance below:

Cassandra cfstats with ganglia

If you’re interested in the scripts I'll send it to you or put it up on code.google.com, its written in perl OOP perl and takes the same approach of packaging that maatkit tool kit for mySQL by Xarb and crew does (puts all the "classes" in the file as the application).

GmetricDelegate is the parent package
GmetricCassandra extends GmetricDelegate and overloads getData as well as defines what is an absolute stats vrs a gauge.

As you can see the pattern I also have
GmetricInnoDB
GmetricMySQL

and so on.

then on each server I run

/usr/bin/perl -w /home/scripts/ganglia_gmetric.pl --module=GmetricCassandra

this then talks to Ganglia through gmetric to report the stats.

Update: I uploaded an alpha version to http://code.google.com/p/gangliastats/ - be warned sparse comments I'll have another check in with documentation soon.