Re: mysqldump re-encoding utf8 multi-byte characters - Why does latin1 solve the problem?
Posted by: Ely Peddler
Date: January 10, 2011 02:11AM

I would guess it's a standard version as it's the one on my hosting provider's server.

The charsets are in an odd location but in the end they didn't seem to make the difference as I explained I thought they were but that was me mis-reading the output.

The data appears to be fine in the database, when I access it via php I get utf8 encoded text output as I would expect, and if I use mysql to select text I see the correct characters decoded from the utf8. So the database is sending out properly encoded utf8 it seems.

The issue appears to be that mysqldump is taking that utf8 and treating it as if it's latin1 and is re-encoding it. Passing --default-character-set=latin1 to mysqldump definitely fixes the problem.

SHOW COLLATION gives the following

+----------------------+----------+-----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+----------------------+----------+-----+---------+----------+---------+
| big5_chinese_ci | big5 | 1 | Yes | Yes | 1 |
| big5_bin | big5 | 84 | | Yes | 1 |
| dec8_swedish_ci | dec8 | 3 | Yes | Yes | 1 |
| dec8_bin | dec8 | 69 | | Yes | 1 |
| cp850_general_ci | cp850 | 4 | Yes | Yes | 1 |
| cp850_bin | cp850 | 80 | | Yes | 1 |
| hp8_english_ci | hp8 | 6 | Yes | Yes | 1 |
| hp8_bin | hp8 | 72 | | Yes | 1 |
| koi8r_general_ci | koi8r | 7 | Yes | Yes | 1 |
| koi8r_bin | koi8r | 74 | | Yes | 1 |
| latin1_german1_ci | latin1 | 5 | | Yes | 1 |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 |
| latin1_danish_ci | latin1 | 15 | | Yes | 1 |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 |
| latin1_bin | latin1 | 47 | | Yes | 1 |
| latin1_general_ci | latin1 | 48 | | Yes | 1 |
| latin1_general_cs | latin1 | 49 | | Yes | 1 |
| latin1_spanish_ci | latin1 | 94 | | Yes | 1 |
| latin2_czech_cs | latin2 | 2 | | Yes | 4 |
| latin2_general_ci | latin2 | 9 | Yes | Yes | 1 |
| latin2_hungarian_ci | latin2 | 21 | | Yes | 1 |
| latin2_croatian_ci | latin2 | 27 | | Yes | 1 |
| latin2_bin | latin2 | 77 | | Yes | 1 |
| swe7_swedish_ci | swe7 | 10 | Yes | Yes | 1 |
| swe7_bin | swe7 | 82 | | Yes | 1 |
| ascii_general_ci | ascii | 11 | Yes | Yes | 1 |
| ascii_bin | ascii | 65 | | Yes | 1 |
| ujis_japanese_ci | ujis | 12 | Yes | Yes | 1 |
| ujis_bin | ujis | 91 | | Yes | 1 |
| sjis_japanese_ci | sjis | 13 | Yes | Yes | 1 |
| sjis_bin | sjis | 88 | | Yes | 1 |
| hebrew_general_ci | hebrew | 16 | Yes | Yes | 1 |
| hebrew_bin | hebrew | 71 | | Yes | 1 |
| tis620_thai_ci | tis620 | 18 | Yes | Yes | 4 |
| tis620_bin | tis620 | 89 | | Yes | 1 |
| euckr_korean_ci | euckr | 19 | Yes | Yes | 1 |
| euckr_bin | euckr | 85 | | Yes | 1 |
| koi8u_general_ci | koi8u | 22 | Yes | Yes | 1 |
| koi8u_bin | koi8u | 75 | | Yes | 1 |
| gb2312_chinese_ci | gb2312 | 24 | Yes | Yes | 1 |
| gb2312_bin | gb2312 | 86 | | Yes | 1 |
| greek_general_ci | greek | 25 | Yes | Yes | 1 |
| greek_bin | greek | 70 | | Yes | 1 |
| cp1250_general_ci | cp1250 | 26 | Yes | Yes | 1 |
| cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
| cp1250_croatian_ci | cp1250 | 44 | | Yes | 1 |
| cp1250_bin | cp1250 | 66 | | Yes | 1 |
| gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 |
| gbk_bin | gbk | 87 | | Yes | 1 |
| latin5_turkish_ci | latin5 | 30 | Yes | Yes | 1 |
| latin5_bin | latin5 | 78 | | Yes | 1 |
| armscii8_general_ci | armscii8 | 32 | Yes | Yes | 1 |
| armscii8_bin | armscii8 | 64 | | Yes | 1 |
| utf8_general_ci | utf8 | 33 | Yes | Yes | 1 |
| utf8_bin | utf8 | 83 | | Yes | 1 |
| utf8_unicode_ci | utf8 | 192 | | Yes | 8 |
| utf8_icelandic_ci | utf8 | 193 | | Yes | 8 |
| utf8_latvian_ci | utf8 | 194 | | Yes | 8 |
| utf8_romanian_ci | utf8 | 195 | | Yes | 8 |
| utf8_slovenian_ci | utf8 | 196 | | Yes | 8 |
| utf8_polish_ci | utf8 | 197 | | Yes | 8 |
| utf8_estonian_ci | utf8 | 198 | | Yes | 8 |
| utf8_spanish_ci | utf8 | 199 | | Yes | 8 |
| utf8_swedish_ci | utf8 | 200 | | Yes | 8 |
| utf8_turkish_ci | utf8 | 201 | | Yes | 8 |
| utf8_czech_ci | utf8 | 202 | | Yes | 8 |
| utf8_danish_ci | utf8 | 203 | | Yes | 8 |
| utf8_lithuanian_ci | utf8 | 204 | | Yes | 8 |
| utf8_slovak_ci | utf8 | 205 | | Yes | 8 |
| utf8_spanish2_ci | utf8 | 206 | | Yes | 8 |
| utf8_roman_ci | utf8 | 207 | | Yes | 8 |
| utf8_persian_ci | utf8 | 208 | | Yes | 8 |
| utf8_esperanto_ci | utf8 | 209 | | Yes | 8 |
| utf8_hungarian_ci | utf8 | 210 | | Yes | 8 |
| ucs2_general_ci | ucs2 | 35 | Yes | Yes | 1 |
| ucs2_bin | ucs2 | 90 | | Yes | 1 |
| ucs2_unicode_ci | ucs2 | 128 | | Yes | 8 |
| ucs2_icelandic_ci | ucs2 | 129 | | Yes | 8 |
| ucs2_latvian_ci | ucs2 | 130 | | Yes | 8 |
| ucs2_romanian_ci | ucs2 | 131 | | Yes | 8 |
| ucs2_slovenian_ci | ucs2 | 132 | | Yes | 8 |
| ucs2_polish_ci | ucs2 | 133 | | Yes | 8 |
| ucs2_estonian_ci | ucs2 | 134 | | Yes | 8 |
| ucs2_spanish_ci | ucs2 | 135 | | Yes | 8 |
| ucs2_swedish_ci | ucs2 | 136 | | Yes | 8 |
| ucs2_turkish_ci | ucs2 | 137 | | Yes | 8 |
| ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
| ucs2_danish_ci | ucs2 | 139 | | Yes | 8 |
| ucs2_lithuanian_ci | ucs2 | 140 | | Yes | 8 |
| ucs2_slovak_ci | ucs2 | 141 | | Yes | 8 |
| ucs2_spanish2_ci | ucs2 | 142 | | Yes | 8 |
| ucs2_roman_ci | ucs2 | 143 | | Yes | 8 |
| ucs2_persian_ci | ucs2 | 144 | | Yes | 8 |
| ucs2_esperanto_ci | ucs2 | 145 | | Yes | 8 |
| ucs2_hungarian_ci | ucs2 | 146 | | Yes | 8 |
| cp866_general_ci | cp866 | 36 | Yes | Yes | 1 |
| cp866_bin | cp866 | 68 | | Yes | 1 |
| keybcs2_general_ci | keybcs2 | 37 | Yes | Yes | 1 |
| keybcs2_bin | keybcs2 | 73 | | Yes | 1 |
| macce_general_ci | macce | 38 | Yes | Yes | 1 |
| macce_bin | macce | 43 | | Yes | 1 |
| macroman_general_ci | macroman | 39 | Yes | Yes | 1 |
| macroman_bin | macroman | 53 | | Yes | 1 |
| cp852_general_ci | cp852 | 40 | Yes | Yes | 1 |
| cp852_bin | cp852 | 81 | | Yes | 1 |
| latin7_estonian_cs | latin7 | 20 | | Yes | 1 |
| latin7_general_ci | latin7 | 41 | Yes | Yes | 1 |
| latin7_general_cs | latin7 | 42 | | Yes | 1 |
| latin7_bin | latin7 | 79 | | Yes | 1 |
| cp1251_bulgarian_ci | cp1251 | 14 | | Yes | 1 |
| cp1251_ukrainian_ci | cp1251 | 23 | | Yes | 1 |
| cp1251_bin | cp1251 | 50 | | Yes | 1 |
| cp1251_general_ci | cp1251 | 51 | Yes | Yes | 1 |
| cp1251_general_cs | cp1251 | 52 | | Yes | 1 |
| cp1256_general_ci | cp1256 | 57 | Yes | Yes | 1 |
| cp1256_bin | cp1256 | 67 | | Yes | 1 |
| cp1257_lithuanian_ci | cp1257 | 29 | | Yes | 1 |
| cp1257_bin | cp1257 | 58 | | Yes | 1 |
| cp1257_general_ci | cp1257 | 59 | Yes | Yes | 1 |
| binary | binary | 63 | Yes | Yes | 1 |
| geostd8_general_ci | geostd8 | 92 | Yes | Yes | 1 |
| geostd8_bin | geostd8 | 93 | | Yes | 1 |
| cp932_japanese_ci | cp932 | 95 | Yes | Yes | 1 |
| cp932_bin | cp932 | 96 | | Yes | 1 |
| eucjpms_japanese_ci | eucjpms | 97 | Yes | Yes | 1 |
| eucjpms_bin | eucjpms | 98 | | Yes | 1 |
+----------------------+----------+-----+---------+----------+---------+

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: mysqldump re-encoding utf8 multi-byte characters - Why does latin1 solve the problem?
3200
January 10, 2011 02:11AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.