Thursday, July 08, 2010

Upgrading Cassandra 0.5.1 to 0.6.3

Every month or so a node randomly dies

EQX root@cass01:/opt/cassandra/bin# ./nodeprobe -host localhost -port 8181 ring
Address Status Load Range Ring
facebook_1301003235_1301003235
10.129.28.22 Down 15.77 GB 9ZehBzpHHwnxiPJU |<--|
10.129.28.23 Up 7.59 GB facebook_100000471858343_1514390063 | |
10.129.28.14 Up 4.59 GB facebook_100000846936312 | |
10.129.28.20 Up 12.94 GB facebook_1301003235_1301003235 |-->|


Trying to get info from the host, the reads timeout.
java.net.SocketTimeoutException: Read timed out


Doing an lsof -p on the java proc I see that it is holding open a bunch of sockets. So the node itself is hanging on something internal is my assumption.

Looking at /var/log/cassandra/system.log I see that the last rotation happened Jun 8th over a month ago and no new log is being written to. THe issue is the node just died today. So this seems like a bug to me.


Now since Cassandra does not tell me what the problem is, I assume that there is a bug in this version and searching Cassandra Jira bug database I see that a lot of stuff is fixed as well as added. So might as well as upgrade.

Before I upgrade I wanted to do research to see if anyone else has. To my surprise there doesn't seem to be any blog talking about upgrading from 0.5 to 0.6.3

I know its rather easy but there is some new stuff in 0.6.3 that is turned on by default: So let's see what changes in the conf

diff /opt/cassandra/conf /opt/apache-cassandra-0.6.3/conf

I see that in storage.xml there is some new XML attributes for the ColumnFamily tag such as RowsCached, new tags called HintedHandoffEnabled, Authenticator, DiskAccessMode, RowWarningThresholdInMB.
Additional to this I noticed that a lot of XML tags are missing. A rolling upgrade is just not possible and is mentioned in NEWS.txt

Thus in my application I set this $GLOBALS['cfg']['disable_nosql_feature'] = 1; I have about 40 toggles to play with, a very helpful process to enable dynamically code with out breaking your site.


now time for an upgrade without the service running:

Steps:

  1. Shut down Cassandra: dsh -g cassandra "pkill java" # same thing as stop-server

  2. rpm -e cassandra-0.5.1

  3. rpm -ivh cassandra-0.6.3.rpm

  4. /opt/cassandra/bin/cassandra



Done. Note what the hell is cassandra-0.6.3.rpm, it's an rpm I created that has my storage-conf.xml
log4j.properties
cassandra.in.sh

After Upgrading:


***************************************************************
WARNING: ./nodeprobe is obsolete, use ./nodetool instead
***************************************************************
Address Status Load Range Ring
facebook_1301003235_1301003235
10.129.28.22 Up 11.75 GB 9ZehBzpHHwnxiPJU |<--|
10.129.28.23 Up 3.04 GB facebook_100000471858343_1514390063 | |
10.129.28.14 Up 2.33 GB facebook_100000846936312 | |
10.129.28.20 Up 4.4 GB facebook_1301003235_1301003235 |-->|



Now what is left to do it change my ganglia scripts / nagios scripts to use nodetool instead of nodeprobe.

No comments: