Friday, June 25, 2010

dsh and TABLE CHECKSUM

So running through some various tasks, I'm finally on the section of work where I can resurrect a script that finds inconsistent data between master-master pairs.

Let's get a quick summary to find our problems to target the script at.


dsh -ef 20 -w dbfacebook9b,dbfacebook9a -s ~/bin/execute_query.sh "FacebookShard \"CHECKSUM TABLE RollingUserLog\""
dbfacebook9b: *************************** 1. row ***************************
dbfacebook9b: Table: FacebookShard.RollingUserLog
dbfacebook9b: Checksum: 538386033
dbfacebook9a: *************************** 1. row ***************************
dbfacebook9a: Table: FacebookShard.RollingUserLog
dbfacebook9a: Checksum: 538386033


not bad.

CHECKSUM TABLE uses a ACCUM algorithm to determine the table checksum. This gives a good fast snapshot in a shared lock mode so writes do not stop while doing a checksum (this is in the context of INNODB). RollingUserLog gets nearly 700 writes per sec per box (No alerts).

But what if there is a problem

# dsh -ef 20 -w dbfacebook38b,dbfacebook38a -s ~/bin/execute_query.sh "FacebookShard \"CHECKSUM TABLE SimulatedProblem\""
dbfacebook38b: *************************** 1. row ***************************
dbfacebook38b: Table: FacebookShard.SimulatedProblem
dbfacebook38b: Checksum: 660032421
dbfacebook38a: *************************** 1. row ***************************
dbfacebook38a: Table: FacebookShard.SimulatedProblem
dbfacebook38a: Checksum: 2533654621


Well let's see if there is a row count mismatch


# dsh -ef 20 -w dbfacebook38b,dbfacebook38a -s ~/bin/execute_query.sh "FacebookShard \"SELECT COUNT(*) FROM SimulatedProblem\""
dbfacebook38b: *************************** 1. row ***************************
dbfacebook38b: COUNT(*): 64358
dbfacebook38a: *************************** 1. row ***************************
dbfacebook38a: COUNT(*): 64358



Ah so the data is inconsistent, we are not missing rows but 1 or more rows have different values. Now time to find it.

To resurrect my script now that I know what table is messed up:

Algorithm is this:

open connection to both servers - fork
compare the data by scanning the table via a index walk
crc32/md5/hash the rows
compare columns
if (dbfacebook38b.hash != dbfacebook38a.hash)
mark row and record primary key to track position

print report

but wait why polish up my script when I can use a formal one that does the job great!

@see

mk-table-checksum
mk-table-sync

# fixes the issue

mk-table-sync h=dbfacebook38a,u=$USER,p=$PASS,D=FacebookShard,t=SimulatedProblem h=dbfacebook38b --execute --no-check-slave


maatkit is awesome. Nearly every problem that I run into and built a script for has been formalized. I'm going to muster up some time and contribute to this toolkit like add my binary log rotate which makes sure that the slaves are caught up to the binary file that is being purged.

1 comment:

Streeter said...

I told Fino about mk-table-checksum so many times... he always just did it by hand.