Monday, August 14, 2006

Quick php script to figure out bad utf8 characters.

When I wrote this post about invalid utf8 characters I needed a way to convert the mysql message into a real identifier for me to take a look at. Below is a quick and dirty script to figure out the bad rows.

I wrote the script below in literally 2 mins. It's really basic and basically just CROSS JOINs two versions of the table one in utf8 the other in latin1 and reports back which string column is not correct. I haven't cleaned the script up, it's ugly and for my purposes will only exist for a short period of time.

test.$TABLE = is the original table latin1 for instance
- use mysqlimport this forces data to not get converted to latin1 from utf8 by setting the character set to binary (i.e. do no convert)

$DB.$TABLE = is the new utf8 table.



#!/usr/bin/php -q


$dbh = mysql_connect('localhost','root','',1);
$DB = 'your db';
$TABLE = 'your table';

if (!$dbh) {
die ("Failed to connect\n");
}


if (!mysql_select_db($DB,$dbh)) {
die ("Failed to changed dbs: " . mysql_error());
}

$query = "SELECT * FROM $TABLE LIMIT 1";
$result = mysql_query($query,$dbh);

$fields = mysql_num_fields($result);

print "Number of fields : $fields\n";

$names = array();

for($i = 0; $i < $fields; $i++) {
$type = mysql_field_type($result, $i);
$name = mysql_field_name($result, $i);

if ($type == 'string') {
$names[] = $name;
}
}
mysql_free_result($result);

foreach($names as $name) {

$query = "SELECT a1.id FROM $DB.$TABLE a1, test.$TABLE a2 WHERE a1.id=a2.id AND BINARY(a1.$name) != BINARY(a2.name)";

$result = mysql_query($query,$dbh);
while($row = mysql_fetch_assoc($result)) {
print "Invalid UTF8 Chars for column $name: ROWID: $row[id]\n";
}

mysql_free_result($result);


}
?>

1 comment:

Anonymous said...

Just a quick note you no longer need to pass the "-q" option. PHP CLI doesn't send HTTP headers...