Thursday, August 03, 2006

Calculating utf8 sizes for varchars

The utf8 spec says that a utf8 character can take up to 4 bytes, mySQL currently only supports up to 3 bytes. So, in essence if your application allowed 255 characters to be inserted into a field, when in utf8 land ie a utf8 column these 255 characters can take up to 765 bytes.

Here is a breakdown from
dev.mysql.com


  • Basic Latin letters, digits, and punctuation signs use one byte.

  • Most European and Middle East script letters fit into a two-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.

  • Korean, Chinese, and Japanese ideographs use three-byte sequences.

No comments: