Let's get a quick summary to find our problems to target the script at.
dsh -ef 20 -w dbfacebook9b,dbfacebook9a -s ~/bin/execute_query.sh "FacebookShard \"CHECKSUM TABLE RollingUserLog\""
dbfacebook9b: *************************** 1. row ***************************
dbfacebook9b: Table: FacebookShard.RollingUserLog
dbfacebook9b: Checksum: 538386033
dbfacebook9a: *************************** 1. row ***************************
dbfacebook9a: Table: FacebookShard.RollingUserLog
dbfacebook9a: Checksum: 538386033
not bad.
CHECKSUM TABLE uses a ACCUM algorithm to determine the table checksum. This gives a good fast snapshot in a shared lock mode so writes do not stop while doing a checksum (this is in the context of INNODB). RollingUserLog gets nearly 700 writes per sec per box (No alerts).
But what if there is a problem
# dsh -ef 20 -w dbfacebook38b,dbfacebook38a -s ~/bin/execute_query.sh "FacebookShard \"CHECKSUM TABLE SimulatedProblem\""
dbfacebook38b: *************************** 1. row ***************************
dbfacebook38b: Table: FacebookShard.SimulatedProblem
dbfacebook38b: Checksum: 660032421
dbfacebook38a: *************************** 1. row ***************************
dbfacebook38a: Table: FacebookShard.SimulatedProblem
dbfacebook38a: Checksum: 2533654621
Well let's see if there is a row count mismatch
# dsh -ef 20 -w dbfacebook38b,dbfacebook38a -s ~/bin/execute_query.sh "FacebookShard \"SELECT COUNT(*) FROM SimulatedProblem\""
dbfacebook38b: *************************** 1. row ***************************
dbfacebook38b: COUNT(*): 64358
dbfacebook38a: *************************** 1. row ***************************
dbfacebook38a: COUNT(*): 64358
Ah so the data is inconsistent, we are not missing rows but 1 or more rows have different values. Now time to find it.
To resurrect my script now that I know what table is messed up:
Algorithm is this:
open connection to both servers - fork
compare the data by scanning the table via a index walk
crc32/md5/hash the rows
compare columns
if (dbfacebook38b.hash != dbfacebook38a.hash)
mark row and record primary key to track position
print report
but wait why polish up my script when I can use a formal one that does the job great!
@see
mk-table-checksum
mk-table-sync
# fixes the issue
mk-table-sync h=dbfacebook38a,u=$USER,p=$PASS,D=FacebookShard,t=SimulatedProblem h=dbfacebook38b --execute --no-check-slave
maatkit is awesome. Nearly every problem that I run into and built a script for has been formalized. I'm going to muster up some time and contribute to this toolkit like add my binary log rotate which makes sure that the slaves are caught up to the binary file that is being purged.
1 comment:
I told Fino about mk-table-checksum so many times... he always just did it by hand.
Post a Comment