Friday, March 26, 2010

Some kernel tweaks to aid Cassandra under a high concurrency environment

For the past couple of weeks I have been trouble shooting some Cassandra issues where data would not make it to Cassandra.

Graph of various tracked Exceptions


The image above graphs all the exceptions that are produced from Cassandra. The two big lines are

Transport Exceptions (te) - meaning that Cassandra could not answer the request think of this as MAX Connection errors in mySQL.


Unavailable Exceptions (ue) - meaning that Cassandra could answer the request but the "storage engine" cannot do anything with it because its busy doing something like communicating with other nodes or maintenance like a node cleanup.


So how did I get the graph to drop to 0? After looking at the error logs, I saw that Cassandra was getting flooded with SYN Requests and the kernel thought that it was a SYN Flood and did this

possible SYN flooding on port 9160. Sending cookies.


To stop this the puppet profile was changed to have

sysctl -w net.ipv4.tcp_max_syn_backlog=4096
sysctl -w net.ipv4.tcp_syncookies=0



Next looking into the Cassandra log which I defined to exist in /var/log/cassandra/system.log


WARN [TCP Selector Manager] 2010-03-26 02:46:31,619 TcpConnectionHandler.java (line 53) Exception was generated at : 03/26/2010 02:
Too many open files
java.io.IOException: Too many open files


Then noticed that
ulimit -n == 1024

thus I changed
/etc/security/limits.conf so that It's at a server setting by adding this:

* - nofile 8000


Now my Transport Exceptions and Unavailable Exceptions are gone and data is being written to it consistently.

There are many other ways of doing the same thing, I could have modified my init script or did some other stuff but I choose this way. Default Distros set kernel and limits fields too low: settings for desktop levels.

Tuesday, March 23, 2010

Cassandra is my NoSQL solution but..

In the past few months, I have tested many NoSQL solutions. Redis, MongoDB, HBase yet Cassandra is the Column Store DB I picked because of its speed (on writes), reliability, built in feature set that makes it multi-datacenter aware. The one other personal reward for Cassandra is it is written in Java. I like reading and writing in Java more than C++ although it really does not matter for me personally in the end.

Let us talk about the reason why I am introducing Cassandra into my infrastructure and some of its drawbacks I have noticed so far.

Why it is being introduced:
We have a feature where we record every single click for 50 million Monthly Active Users (real-time) and storing this in mySQL is just waste of semi-good hardware for data that is only looked at for the past 24 hours. Over the course of some time (couple of months) more than 3 billion rows accumulated, which translated into a 3.5 TB distributed INNODB datafile. So purging/archiving this data just sucks.

Now introducing user clicks into Cassandra was rather easy. I researched various sites, asked my Digg buddy and then figured out the rest. Within two days I was up and running with Cassandra and had a great understanding about Column Families and SuperColumns.

Developing with a Cassandra Data layer:

Now that the infrastructure is up, I needed to add a data model to /opt/cassandra/conf/storage.conf. The SQL that drove this functionality consisted of two main SQL statements.
Add click
Get a Range of Clicks

For these operations, mySQL rarely takes 0.001 seconds (1 ms).

Cassandra for writes is rather fast, but for reads, Getting Range Clicks
i.e. using
recv_get_slice

it takes .02 seconds (20 ms).

What does this mean? MySQL is a hell of a lot faster! Is it because of my CF design? No.
for instance, take this slug (keys separated by a delimiter to make a distinct key name).

{$clicker}_{$pageowner} => [ $object_id_clicked0 => $value,$ts, ... $object_id_clicked =>N $value,$ts ]


The slug says that clicker A, clicked on page B and what is stored are columns of distinct clicks - remember this is a column store DB.

So
{$clicker}_{$pageowner} == the KEY for the COLUMN $object_id_clicked with the value being a $value and the free extra value the $timestamp.

Here is the php code

$columnPath = $this->getColumnPath($objectid);
$this->getCassandraConnect()->insert(self::KEYSPACE, $this->getKey($clicker, $pageOwner), $columnPath, $this->today_ts, microtime(true), ConsistencyLevel::ZERO);



Now I want a list of items clicked.

$data = $this->getCassandraConnect()->get_slice(self::KEYSPACE, $this->getKey($clicker, $pageOwner), $this->getColumnParent(), $this->getSlicePredicate(), ConsistencyLevel::ONE);


This says give me the last N logged clicks that the clicker A made for page owner B. This is a hash lookup (Big-O(1)) but a sorted list of columns (O(nlogn)) and return the last N elements.

Why is Cassandra sooo slow on reads. Is it because my memory config is not enough?
No. 7 GB of data is allocated for data that fits in memory ( for now ).


Is it because my disk is saturated?

avg-cpu: %user %nice %system %iowait %steal %idle
12.21 0.00 2.85 0.48 0.00 84.47

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.01 14.01 0.23 6.61 22.25 165.03 27.34 0.33 47.57 7.35 5.04
sda1 0.00 0.00 0.00 0.00 0.00 0.00 20.83 0.00 7.74 6.40 0.00
sda2 0.01 14.01 0.23 6.61 22.25 165.03 27.34 0.33 47.57 7.35 5.04
dm-0 0.00 0.00 0.24 20.63 22.25 165.03 8.97 0.72 34.27 2.41 5.04
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 5.28 0.22 0.00



No its not.

My hunch is the slowdown is a combo of the low thread read pool and in the inherent sorting and there doesn't seem to be a way to turn it off, so without digging in the code I couldn't say (I will be able to once I get an understanding of the code layout).


In conclusion, I like Cassandra, it’s very fast in writes, slow (for my taste but fast enough) in reads and what takes 2 lines of SQL takes 250 lines of PHP code interfacing with the Cassandra.thrift suite. I am going to use it still because it is good enough and I love the built in HA of it. Additionally the performance is improving very quickly.

Wednesday, March 10, 2010

Talking at the University of Utah

Giving a talk at the University of Utah on everything from scaling, clustering, mysql, mysql internals, noSQL (Cassandra) to how to manage all this stuff. If you are there at University I'm bringing some Swag!

Also I will upload the slides and put them here.