Monday, July 31, 2006

Rebuilding an 4.0 data into 4.1 data

In 4.0 the default character set is latin1. When converting to 4.1 this relationship is preserved. In many cases as in the case at flickr.com we store utf8 data in latin1 columns. So, think of the situation as latin1 being the container for utf8. This screws up a lot of thing especially when Java is used. We have to hack the driver to tell it even though the server is returning the data is latin1 it's really UTF8.


So, what is a good way to convert to true utf8 character set with a unicode collation.

Rebuild the data! This is the way that I'm doing it to also shrink the innodb datafile. There is a lot of free space.

Now remember we have utf8 data in a latin1 container column, so we need to make sure that we get a dump of the raw data. MySQL will double pad the data with default options converting the latin1 charset to utf8, which is garbage.

mysqldump -hHOST -uroot -p DB TABLE --no-create-info --compatible=mysql323

This statement will dump the raw data, the option that tells mysqldump to do so is mysql323 compatible.


My personal preference is to

mysqldump -hHOST -uroot -p DB TABLE -T /tmp/

What this does is SELECT * INTO OUTFILE '/tmp/TABLE.txt'

This to will get raw data.

Since all of my tables are INNODB, dumping it out sorted by the PRIMARY index is good for re-import via LOAD DATA.

Next we dump the table structure and remove all occurances of latin1 on each of the column definitions.

Finally reimport the data

Start the slave.


So here is a good procedure.


  1. Stop the Slave

  2. Stop MySQL

  3. Put in your my.cnf skip-slave-start

  4. Start mySQL (we do the above steps to make sure everything is flushed to disk just to be sure)

  5. Next check that the dump point has enough space

  6. Dump the data via the method described above for each table

  7. Dump the table definition

  8. Stop mySQL

  9. Destroy the INNODB datafile and log files rm /var/lib/mysql/ib*

  10. Remove Database directory

  11. Edit my.cnf and define default startup for the server


    [mysqld]
    character-set-server=utf8
    collation-server=utf8_unicode_ci
    character-set-server=utf8
    collation-server=utf8_unicode_ci
    #
    # on every connection we tell the server that transfer data as utf8 so it doesn't translate data comming in for us. SET NAMES is equivlent to
    # SET character_set_client = utf8; SET character_set_connection = utf8;SET character_set_results = utf8
    #
    init-connect="SET NAMES = utf8"




  12. Start mySQL

  13. CREATE DATABASE DATABASENAME DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci

  14. Remove all references to latin1 from the column definition of each table

  15. At the end of each table definition add ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

  16. Add the tables to the database

  17. IMPORT THE DATA




Let me know if you have any questions.

5 comments:

Sal said...

only 1 question: what's "colliation"?

Dathan Pattishall said...

that's a bad spelling mistake. I didn't think anyone was reading this yet.

Sal said...

ah ha! thank you.

(i came here via cal's linklog- i assumed he was a mate of yours)

Frank said...

Hey so is Flickr currently migrating to 4.1? Just curious. Why not 5.x?

Nice blog BTW
Frank

Dathan Pattishall said...

We migrated to 4.1 a while ago but now I want to rebuild the data into pure utf8, since I have to reduce the size of the datafile.

I didn't want to use 5.0 because 4.1 is nice and stable and we don't need the features in 5.0.