Thursday, July 08, 2010

Upgrading Cassandra 0.5.1 to 0.6.3

Every month or so a node randomly dies

EQX root@cass01:/opt/cassandra/bin# ./nodeprobe -host localhost -port 8181 ring
Address Status Load Range Ring
facebook_1301003235_1301003235 Down 15.77 GB 9ZehBzpHHwnxiPJU |<--| Up 7.59 GB facebook_100000471858343_1514390063 | | Up 4.59 GB facebook_100000846936312 | | Up 12.94 GB facebook_1301003235_1301003235 |-->|

Trying to get info from the host, the reads timeout. Read timed out

Doing an lsof -p on the java proc I see that it is holding open a bunch of sockets. So the node itself is hanging on something internal is my assumption.

Looking at /var/log/cassandra/system.log I see that the last rotation happened Jun 8th over a month ago and no new log is being written to. THe issue is the node just died today. So this seems like a bug to me.

Now since Cassandra does not tell me what the problem is, I assume that there is a bug in this version and searching Cassandra Jira bug database I see that a lot of stuff is fixed as well as added. So might as well as upgrade.

Before I upgrade I wanted to do research to see if anyone else has. To my surprise there doesn't seem to be any blog talking about upgrading from 0.5 to 0.6.3

I know its rather easy but there is some new stuff in 0.6.3 that is turned on by default: So let's see what changes in the conf

diff /opt/cassandra/conf /opt/apache-cassandra-0.6.3/conf

I see that in storage.xml there is some new XML attributes for the ColumnFamily tag such as RowsCached, new tags called HintedHandoffEnabled, Authenticator, DiskAccessMode, RowWarningThresholdInMB.
Additional to this I noticed that a lot of XML tags are missing. A rolling upgrade is just not possible and is mentioned in NEWS.txt

Thus in my application I set this $GLOBALS['cfg']['disable_nosql_feature'] = 1; I have about 40 toggles to play with, a very helpful process to enable dynamically code with out breaking your site.

now time for an upgrade without the service running:


  1. Shut down Cassandra: dsh -g cassandra "pkill java" # same thing as stop-server

  2. rpm -e cassandra-0.5.1

  3. rpm -ivh cassandra-0.6.3.rpm

  4. /opt/cassandra/bin/cassandra

Done. Note what the hell is cassandra-0.6.3.rpm, it's an rpm I created that has my storage-conf.xml

After Upgrading:

WARNING: ./nodeprobe is obsolete, use ./nodetool instead
Address Status Load Range Ring
facebook_1301003235_1301003235 Up 11.75 GB 9ZehBzpHHwnxiPJU |<--| Up 3.04 GB facebook_100000471858343_1514390063 | | Up 2.33 GB facebook_100000846936312 | | Up 4.4 GB facebook_1301003235_1301003235 |-->|

Now what is left to do it change my ganglia scripts / nagios scripts to use nodetool instead of nodeprobe.

No comments: