Re: Converting latin1 charset to utf8
Posted by: Rick James
Date: March 12, 2009 10:04AM

Partial answers...

Do you have different encodings in the same field now? If so, that will be a programming problem to extract the values, figure out what they are, and convert them.

Table setting -- that is just a default. You are barking up the wrong tree by suggesting changing the "table settings".

The trick of going thru blob is for converting from old MySQL versions (that did not handle charsets) to new. (And it will only work if you know that all the values are in a single encoding.)

I have not tried this, but I suspect it would work:
1. SET NAMES utf8; (not necessary, but declares what your _client_ is up to);
2. Select a value from a column declared latin1;
3. Immediately insert into a column declared utf8.
Why? Well, conversions go on whenever you insert/extract data. You need one conversion, and you need it to be "lossless". I think these steps will achieve that. (Note: All latin1 chars can be converted, losslessly, to utf8. But not vice versa.)

Suggestion: Do this in the old table, and in the new:
SELECT LENGTH(col), CHAR_LENGTH(col) FROM...
* For latin1, those two numbers should be the same. (If not, something else is going on, suggest getting HEX(col) for further analysis.)
* For utf8,... If you have Western languages, each accented character will count as 2 for LENGTH and 1 for CHAR_LENGTH. For Asian languages, LENGTH will be about 3 * CHAR_LENGTH (possibly a little less). Again, if you get something else, we need to dig further.

In the future, be sure to do
SET NAMES utf8
in the client, immediately after connecting. (Assuming you are receiving characters from the outside in 'utf8' encoding.) Otherwise you may end up with double-encoding. It is a mess to unravel that.

Options: ReplyQuote


Subject
Views
Written By
Posted
7039
March 11, 2009 10:46AM
Re: Converting latin1 charset to utf8
3589
March 12, 2009 10:04AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.