Tuesday, March 23, 2010

Cassandra is my NoSQL solution but..

In the past few months, I have tested many NoSQL solutions. Redis, MongoDB, HBase yet Cassandra is the Column Store DB I picked because of its speed (on writes), reliability, built in feature set that makes it multi-datacenter aware. The one other personal reward for Cassandra is it is written in Java. I like reading and writing in Java more than C++ although it really does not matter for me personally in the end.

Let us talk about the reason why I am introducing Cassandra into my infrastructure and some of its drawbacks I have noticed so far.

Why it is being introduced:
We have a feature where we record every single click for 50 million Monthly Active Users (real-time) and storing this in mySQL is just waste of semi-good hardware for data that is only looked at for the past 24 hours. Over the course of some time (couple of months) more than 3 billion rows accumulated, which translated into a 3.5 TB distributed INNODB datafile. So purging/archiving this data just sucks.

Now introducing user clicks into Cassandra was rather easy. I researched various sites, asked my Digg buddy and then figured out the rest. Within two days I was up and running with Cassandra and had a great understanding about Column Families and SuperColumns.

Developing with a Cassandra Data layer:

Now that the infrastructure is up, I needed to add a data model to /opt/cassandra/conf/storage.conf. The SQL that drove this functionality consisted of two main SQL statements.
Add click
Get a Range of Clicks

For these operations, mySQL rarely takes 0.001 seconds (1 ms).

Cassandra for writes is rather fast, but for reads, Getting Range Clicks
i.e. using

it takes .02 seconds (20 ms).

What does this mean? MySQL is a hell of a lot faster! Is it because of my CF design? No.
for instance, take this slug (keys separated by a delimiter to make a distinct key name).

{$clicker}_{$pageowner} => [ $object_id_clicked0 => $value,$ts, ... $object_id_clicked =>N $value,$ts ]

The slug says that clicker A, clicked on page B and what is stored are columns of distinct clicks - remember this is a column store DB.

{$clicker}_{$pageowner} == the KEY for the COLUMN $object_id_clicked with the value being a $value and the free extra value the $timestamp.

Here is the php code

$columnPath = $this->getColumnPath($objectid);
$this->getCassandraConnect()->insert(self::KEYSPACE, $this->getKey($clicker, $pageOwner), $columnPath, $this->today_ts, microtime(true), ConsistencyLevel::ZERO);

Now I want a list of items clicked.

$data = $this->getCassandraConnect()->get_slice(self::KEYSPACE, $this->getKey($clicker, $pageOwner), $this->getColumnParent(), $this->getSlicePredicate(), ConsistencyLevel::ONE);

This says give me the last N logged clicks that the clicker A made for page owner B. This is a hash lookup (Big-O(1)) but a sorted list of columns (O(nlogn)) and return the last N elements.

Why is Cassandra sooo slow on reads. Is it because my memory config is not enough?
No. 7 GB of data is allocated for data that fits in memory ( for now ).

Is it because my disk is saturated?

avg-cpu: %user %nice %system %iowait %steal %idle
12.21 0.00 2.85 0.48 0.00 84.47

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.01 14.01 0.23 6.61 22.25 165.03 27.34 0.33 47.57 7.35 5.04
sda1 0.00 0.00 0.00 0.00 0.00 0.00 20.83 0.00 7.74 6.40 0.00
sda2 0.01 14.01 0.23 6.61 22.25 165.03 27.34 0.33 47.57 7.35 5.04
dm-0 0.00 0.00 0.24 20.63 22.25 165.03 8.97 0.72 34.27 2.41 5.04
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 5.28 0.22 0.00

No its not.

My hunch is the slowdown is a combo of the low thread read pool and in the inherent sorting and there doesn't seem to be a way to turn it off, so without digging in the code I couldn't say (I will be able to once I get an understanding of the code layout).

In conclusion, I like Cassandra, it’s very fast in writes, slow (for my taste but fast enough) in reads and what takes 2 lines of SQL takes 250 lines of PHP code interfacing with the Cassandra.thrift suite. I am going to use it still because it is good enough and I love the built in HA of it. Additionally the performance is improving very quickly.


Unknown said...


Feel free to join us in the #cassandra irc channel on freenode for some help in getting better performance out of Cassandra. I think you can probably redesign your column families in such a way that you can use the ordered partitioner and still get the data you need.

James (jbathgate)

Anonymous said...

Have you tried partitioning with MySQL? Like partitioning by date. Selects which are in a range which is in one partition won't touch the other partitions. You can set innodb to have file per table and in this case every partition will be another file. It is another table because partitioning is performed at MySQL level, not the storage engine. Dropping a partition is quite fast operation.

Anonymous said...

You should also look at MySQL column DB's which may be a good fit for you (e.g. Calpont/InfiniDB, others). Best of both worlds for some use cases.

Mark Callaghan said...

Nice post.

I am not sure you should trust all of the anon advice you get here (InfiniDB for OTLP?). However, TokuDB would have been good for your workload.

What is the max IOPs rate you can drive using Cassandra? I can get ~20k from InnoDB, ~40k from PBXT and ~80k from MyISAM for a read-only workload. I ask because one server that provides 10k IOPs is cheaper to run than 10 that do 100. I have yet to find results for HBase and Cassandra.

Anonymous said...

Interesting post, but i'm not sure the goal is incredibly fast reads but consistent read times as data grows and infrastructure scales.

Unknown said...

Since you're using supercolumns, you need to be aware of CASSANDRA-598. If you're inserting 10,000 subcolumns but only asking for the last 10, the entire 10,000 will need to be deserialized until that ticket is closed. If you instead used a simple CF, or less subcolumns, get_slice would be much faster.

Dathan Pattishall said...

On #cassandra all the time. jellis helped me get rid of a crash bug

@anon about partitions - yes I thought about it and use it in other places but since this is a high write high concurrency throughput blocking for a few 100 ms is not ideal.

calculating the raw disk iops seem to be on par with MYISAM (since append only) I'm still building benches to get a good test that can fit the mustard of a good profile of this service. I'll update this soon.

@drift - this is not a super column

so maybe your get slice bug that you mentioned also effects regular ColumnTypes?

Dathan Pattishall said...

@mark sorry you said read only my assumption was that for write only read only I do not have a good test yet.

Gustavo Niemeyer said...

I don't think Cassandra reads are as bad as you make it look like in the post. For a very thorough benchmark, check out the following whitepaper by Yahoo! Research:


Note that the whitepaper is based on an old version of Cassandra, and there were specific improvements in the read area in recent releases, which means the figure is even less significant.

Kapil Thangavelu said...

You neglected to mention which version of cassandra your using. Cassandra 0.6 (currently in beta) has significantly improved read performance via better caches.

Dathan Pattishall said...

@Kapil Thangavelu using latest 5

@Gustavo Niemeyer - if you look at the graph Cassandra reads are in the 20 - 60 ms range same as I see.

Kapil Thangavelu said...

Thanks for the version info Dathan. If you've got the inclination it might be worthwhile to test with 0.6, which adds a row cache as per

Depending on your workload that might be of significant help, then again some workloads don't get much from a cache.