Re: what is the step to convert a db from latin1 to UTF8
Posted by: Rick James
Date: October 17, 2009 07:01PM

Here's the model, as I understand it...

* The bytes in a char or text field are assumed to be encoded according to the character set associated with the column. (This association may be inherited from the system/database/table; use SHOW CREATE TABLE to find out the encoding.)

* You declare the bytes coming from (and going back to) your application by using SET NAMES or setting these three VARIABLES:
| character_set_client     | utf8
| character_set_connection | utf8
| character_set_results    | utf8

* When data flows from point A to point B (eg, client to server), MySQL looks at the encodings of A and B (see bullet items above) to decide whether a conversion is needed. If it is needed, it applies the appropriate conversion (without you needing to call iconv).

The typical problem is that people fail to do SET NAMES, and the bytes get converted (or fail to get converted) as they are inserted into the table.

Sometimes, you are lucky -- bytes go in incorrectly, then come out incorrectly, but the "two wrongs make a right". This really confuses diagnosing things.

Sometimes, you can get a "double conversion". Example: The column is declared utf8. The client is latin1, but the bytes in the client are really utf8. Inserting a 2-byte e-grave (for example) turns into 4 bytes. As long as the reading client is also latin1 (according to character_set_results), but treats the bytes as utf8, nothing seems broken.

I would like to get my head wrapped around your problem and be able to say "do X and Y", but I am not there. I spend the last day trying to unravel some Hebrew characters; they seem to be taking 5 bytes in the table, when they should take only 2.

Could you give me some actual data to experiment with...
SHOW CREATE TABLE tbl \G
Find some data with accents, and dump it thus:
SELECT x, HEX(x), LENGTH(x), CHAR_LENGTH(x) FROM tbl WHERE ... LIMIT 10;
If possible, tell me what you are expecting 'x' to have in those cases. The declared datatype and the HEX give me something to play with.

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: what is the step to convert a db from latin1 to UTF8
3045
October 17, 2009 07:01PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.