Re: what is the step to convert a db from latin1 to UTF8
Posted by: Rick James
Date: October 14, 2009 10:34PM
You should not have to use iconv. MySQL, php, asp, etc, etc each receive/send some bytes. The problem is that the bytes do not declare whether they are utf8 encodings for some characters, or latin1 encodings, or whatever.
You have an unknown number of problems.
The client talking to MySQL should declare "my bytes are utf8-encoded"; this is done by "SET NAMES utf8" right after connecting.
Any bytes in the database may be 'wrong'. Let's investigate them by doing these:
SHOW CREATE TABLE tbl \G -- need to see how it is declared.
SELECT x, hex(x), length(x), char_length(x) FROM tbl
to see what is in field 'x'. Please do this for some rows with short values for x that have accent characters.
If you have Western European accent characters encoded in latin1:
length(x) will be the same as char_length(x) and hex(x) will show one hex pair for each accented character.
If you have Western European accent characters encoded in utf8:
length(x) > char_length(x) (1 extra per accented letter)
For Chinese in utf8:
length(x) will be about 3 times char_length(x)
If an accented letter takes 4 bytes, you have suffered from "double encoding".
The above hex stuff is independent of how the row is declared. Now for some conversion suggestions:
If the declared encoding matches apparent encoding, there is nothing to change (here).
If there is a mismatch, consider this 2-step (assuming utf8 bytes in a latin1 VARCHAR (44) column):
ALTER TABLE tbl MODIFY COLUMN x VARBINARY(44);
ALTER TABLE tbl MODIFY COLUMN x VARCHAR(44) CHARACTER SET utf8;