Re: Croatian characters not surviving LOAD DATA INFILE from UTF-8 .txt file
Posted by: Paul Pikowsky
Date: August 23, 2011 10:38AM

Many thanks for your prompt reply.

Quote

Kupreškić <-- This has the letters encoded according to some character set (probably utf8 on this forum)
Kupre%C5%A1ki%C4%87 <-- This is "html encoding". It is understood only by browsers.

I am not concerned about 'Kupre%C5%A1ki%C4%87' and other similar characters, these are surviving intact from the LOAD DATA INFILE command. They represent a URL, a link, and so need to use these characters.

Quote

You have CHARACTER SET latin2 on the table.
What arguments did you use when doing the LOAD DATA? That makes a big difference.

Here is the command copy and pasted from the command line:

LOAD DATA LOCAL INFILE '/tmp/XXXXX.txt' INTO TABLE xxxxx FIELDS TERMINATED BY '**';

Quote

Are the values in the table correct? That is hard to test, SELECT HEX may help. See this for further discussion:
http://mysql.rjweb.org/doc.php/charcoll

I've seen this reference from searches I did at this site, could you direct me to a particular part? Where, for example, do I apply "SELECT HEX"?

Quote

Kupre??kiÄ? <-- This is very likely due to incorrect settings either during loading of the data, or during fetching the data. Again see my web page.

I think 5.5.12 has a utf8_croatian_ci; it might be better in the long run to use utf8 instead of latin2 -- this would let you use character sets from all languages, and might avoid the need for different columns having different charsets.
SHOW COLLATION;

Kupre??kiÄ? This kind of result is what is causing the problem.

Here are the results of SHOW COLLATION:

mysql> SHOW COLLATION;
+-----------------------+----------+-----+---------+----------+---------+
| Collation             | Charset  | Id  | Default | Compiled | Sortlen |
+-----------------------+----------+-----+---------+----------+---------+
| big5_chinese_ci       | big5     |   1 | Yes     | Yes      |       1 |
| big5_bin              | big5     |  84 |         | Yes      |       1 |
| dec8_swedish_ci       | dec8     |   3 | Yes     | Yes      |       1 |
| dec8_bin              | dec8     |  69 |         | Yes      |       1 |
| cp850_general_ci      | cp850    |   4 | Yes     | Yes      |       1 |
| cp850_bin             | cp850    |  80 |         | Yes      |       1 |
| hp8_english_ci        | hp8      |   6 | Yes     | Yes      |       1 |
| hp8_bin               | hp8      |  72 |         | Yes      |       1 |
| koi8r_general_ci      | koi8r    |   7 | Yes     | Yes      |       1 |
| koi8r_bin             | koi8r    |  74 |         | Yes      |       1 |
| latin1_german1_ci     | latin1   |   5 |         | Yes      |       1 |
| latin1_swedish_ci     | latin1   |   8 | Yes     | Yes      |       1 |
| latin1_danish_ci      | latin1   |  15 |         | Yes      |       1 |
| latin1_german2_ci     | latin1   |  31 |         | Yes      |       2 |
| latin1_bin            | latin1   |  47 |         | Yes      |       1 |
| latin1_general_ci     | latin1   |  48 |         | Yes      |       1 |
| latin1_general_cs     | latin1   |  49 |         | Yes      |       1 |
| latin1_spanish_ci     | latin1   |  94 |         | Yes      |       1 |
| latin2_czech_cs       | latin2   |   2 |         | Yes      |       4 |
| latin2_general_ci     | latin2   |   9 | Yes     | Yes      |       1 |
| latin2_hungarian_ci   | latin2   |  21 |         | Yes      |       1 |
| latin2_croatian_ci    | latin2   |  27 |         | Yes      |       1 |
| latin2_bin            | latin2   |  77 |         | Yes      |       1 |
| swe7_swedish_ci       | swe7     |  10 | Yes     | Yes      |       1 |
| swe7_bin              | swe7     |  82 |         | Yes      |       1 |
| ascii_general_ci      | ascii    |  11 | Yes     | Yes      |       1 |
| ascii_bin             | ascii    |  65 |         | Yes      |       1 |
| ujis_japanese_ci      | ujis     |  12 | Yes     | Yes      |       1 |
| ujis_bin              | ujis     |  91 |         | Yes      |       1 |
| sjis_japanese_ci      | sjis     |  13 | Yes     | Yes      |       1 |
| sjis_bin              | sjis     |  88 |         | Yes      |       1 |
| hebrew_general_ci     | hebrew   |  16 | Yes     | Yes      |       1 |
| hebrew_bin            | hebrew   |  71 |         | Yes      |       1 |
| tis620_thai_ci        | tis620   |  18 | Yes     | Yes      |       4 |
| tis620_bin            | tis620   |  89 |         | Yes      |       1 |
| euckr_korean_ci       | euckr    |  19 | Yes     | Yes      |       1 |
| euckr_bin             | euckr    |  85 |         | Yes      |       1 |
| koi8u_general_ci      | koi8u    |  22 | Yes     | Yes      |       1 |
| koi8u_bin             | koi8u    |  75 |         | Yes      |       1 |
| gb2312_chinese_ci     | gb2312   |  24 | Yes     | Yes      |       1 |
| gb2312_bin            | gb2312   |  86 |         | Yes      |       1 |
| greek_general_ci      | greek    |  25 | Yes     | Yes      |       1 |
| greek_bin             | greek    |  70 |         | Yes      |       1 |
| cp1250_general_ci     | cp1250   |  26 | Yes     | Yes      |       1 |
| cp1250_czech_cs       | cp1250   |  34 |         | Yes      |       2 |
| cp1250_croatian_ci    | cp1250   |  44 |         | Yes      |       1 |
| cp1250_bin            | cp1250   |  66 |         | Yes      |       1 |
| cp1250_polish_ci      | cp1250   |  99 |         | Yes      |       1 |
| gbk_chinese_ci        | gbk      |  28 | Yes     | Yes      |       1 |
| gbk_bin               | gbk      |  87 |         | Yes      |       1 |
| latin5_turkish_ci     | latin5   |  30 | Yes     | Yes      |       1 |
| latin5_bin            | latin5   |  78 |         | Yes      |       1 |
| armscii8_general_ci   | armscii8 |  32 | Yes     | Yes      |       1 |
| armscii8_bin          | armscii8 |  64 |         | Yes      |       1 |
| utf8_general_ci       | utf8     |  33 | Yes     | Yes      |       1 |
| utf8_bin              | utf8     |  83 |         | Yes      |       1 |
| utf8_unicode_ci       | utf8     | 192 |         | Yes      |       8 |
| utf8_icelandic_ci     | utf8     | 193 |         | Yes      |       8 |
| utf8_latvian_ci       | utf8     | 194 |         | Yes      |       8 |
| utf8_romanian_ci      | utf8     | 195 |         | Yes      |       8 |
| utf8_slovenian_ci     | utf8     | 196 |         | Yes      |       8 |
| utf8_polish_ci        | utf8     | 197 |         | Yes      |       8 |
| utf8_estonian_ci      | utf8     | 198 |         | Yes      |       8 |
| utf8_spanish_ci       | utf8     | 199 |         | Yes      |       8 |
| utf8_swedish_ci       | utf8     | 200 |         | Yes      |       8 |
| utf8_turkish_ci       | utf8     | 201 |         | Yes      |       8 |
| utf8_czech_ci         | utf8     | 202 |         | Yes      |       8 |
| utf8_danish_ci        | utf8     | 203 |         | Yes      |       8 |
| utf8_lithuanian_ci    | utf8     | 204 |         | Yes      |       8 |
| utf8_slovak_ci        | utf8     | 205 |         | Yes      |       8 |
| utf8_spanish2_ci      | utf8     | 206 |         | Yes      |       8 |
| utf8_roman_ci         | utf8     | 207 |         | Yes      |       8 |
| utf8_persian_ci       | utf8     | 208 |         | Yes      |       8 |
| utf8_esperanto_ci     | utf8     | 209 |         | Yes      |       8 |
| utf8_hungarian_ci     | utf8     | 210 |         | Yes      |       8 |
| utf8_sinhala_ci       | utf8     | 211 |         | Yes      |       8 |
| ucs2_general_ci       | ucs2     |  35 | Yes     | Yes      |       1 |
| ucs2_bin              | ucs2     |  90 |         | Yes      |       1 |
| ucs2_unicode_ci       | ucs2     | 128 |         | Yes      |       8 |
| ucs2_icelandic_ci     | ucs2     | 129 |         | Yes      |       8 |
| ucs2_latvian_ci       | ucs2     | 130 |         | Yes      |       8 |
| ucs2_romanian_ci      | ucs2     | 131 |         | Yes      |       8 |
| ucs2_slovenian_ci     | ucs2     | 132 |         | Yes      |       8 |
| ucs2_polish_ci        | ucs2     | 133 |         | Yes      |       8 |
| ucs2_estonian_ci      | ucs2     | 134 |         | Yes      |       8 |
| ucs2_spanish_ci       | ucs2     | 135 |         | Yes      |       8 |
| ucs2_swedish_ci       | ucs2     | 136 |         | Yes      |       8 |
| ucs2_turkish_ci       | ucs2     | 137 |         | Yes      |       8 |
| ucs2_czech_ci         | ucs2     | 138 |         | Yes      |       8 |
| ucs2_danish_ci        | ucs2     | 139 |         | Yes      |       8 |
| ucs2_lithuanian_ci    | ucs2     | 140 |         | Yes      |       8 |
| ucs2_slovak_ci        | ucs2     | 141 |         | Yes      |       8 |
| ucs2_spanish2_ci      | ucs2     | 142 |         | Yes      |       8 |
| ucs2_roman_ci         | ucs2     | 143 |         | Yes      |       8 |
| ucs2_persian_ci       | ucs2     | 144 |         | Yes      |       8 |
| ucs2_esperanto_ci     | ucs2     | 145 |         | Yes      |       8 |
| ucs2_hungarian_ci     | ucs2     | 146 |         | Yes      |       8 |
| ucs2_sinhala_ci       | ucs2     | 147 |         | Yes      |       8 |
| cp866_general_ci      | cp866    |  36 | Yes     | Yes      |       1 |
| cp866_bin             | cp866    |  68 |         | Yes      |       1 |
| keybcs2_general_ci    | keybcs2  |  37 | Yes     | Yes      |       1 |
| keybcs2_bin           | keybcs2  |  73 |         | Yes      |       1 |
| macce_general_ci      | macce    |  38 | Yes     | Yes      |       1 |
| macce_bin             | macce    |  43 |         | Yes      |       1 |
| macroman_general_ci   | macroman |  39 | Yes     | Yes      |       1 |
| macroman_bin          | macroman |  53 |         | Yes      |       1 |
| cp852_general_ci      | cp852    |  40 | Yes     | Yes      |       1 |
| cp852_bin             | cp852    |  81 |         | Yes      |       1 |
| latin7_estonian_cs    | latin7   |  20 |         | Yes      |       1 |
| latin7_general_ci     | latin7   |  41 | Yes     | Yes      |       1 |
| latin7_general_cs     | latin7   |  42 |         | Yes      |       1 |
| latin7_bin            | latin7   |  79 |         | Yes      |       1 |
| utf8mb4_general_ci    | utf8mb4  |  45 | Yes     | Yes      |       1 |
| utf8mb4_bin           | utf8mb4  |  46 |         | Yes      |       1 |
| utf8mb4_unicode_ci    | utf8mb4  | 224 |         | Yes      |       8 |
| utf8mb4_icelandic_ci  | utf8mb4  | 225 |         | Yes      |       8 |
| utf8mb4_latvian_ci    | utf8mb4  | 226 |         | Yes      |       8 |
| utf8mb4_romanian_ci   | utf8mb4  | 227 |         | Yes      |       8 |
| utf8mb4_slovenian_ci  | utf8mb4  | 228 |         | Yes      |       8 |
| utf8mb4_polish_ci     | utf8mb4  | 229 |         | Yes      |       8 |
| utf8mb4_estonian_ci   | utf8mb4  | 230 |         | Yes      |       8 |
| utf8mb4_spanish_ci    | utf8mb4  | 231 |         | Yes      |       8 |
| utf8mb4_swedish_ci    | utf8mb4  | 232 |         | Yes      |       8 |
| utf8mb4_turkish_ci    | utf8mb4  | 233 |         | Yes      |       8 |
| utf8mb4_czech_ci      | utf8mb4  | 234 |         | Yes      |       8 |
| utf8mb4_danish_ci     | utf8mb4  | 235 |         | Yes      |       8 |
| utf8mb4_lithuanian_ci | utf8mb4  | 236 |         | Yes      |       8 |
| utf8mb4_slovak_ci     | utf8mb4  | 237 |         | Yes      |       8 |
| utf8mb4_spanish2_ci   | utf8mb4  | 238 |         | Yes      |       8 |
| utf8mb4_roman_ci      | utf8mb4  | 239 |         | Yes      |       8 |
| utf8mb4_persian_ci    | utf8mb4  | 240 |         | Yes      |       8 |
| utf8mb4_esperanto_ci  | utf8mb4  | 241 |         | Yes      |       8 |
| utf8mb4_hungarian_ci  | utf8mb4  | 242 |         | Yes      |       8 |
| utf8mb4_sinhala_ci    | utf8mb4  | 243 |         | Yes      |       8 |
| cp1251_bulgarian_ci   | cp1251   |  14 |         | Yes      |       1 |
| cp1251_ukrainian_ci   | cp1251   |  23 |         | Yes      |       1 |
| cp1251_bin            | cp1251   |  50 |         | Yes      |       1 |
| cp1251_general_ci     | cp1251   |  51 | Yes     | Yes      |       1 |
| cp1251_general_cs     | cp1251   |  52 |         | Yes      |       1 |
| utf16_general_ci      | utf16    |  54 | Yes     | Yes      |       1 |
| utf16_bin             | utf16    |  55 |         | Yes      |       1 |
| utf16_unicode_ci      | utf16    | 101 |         | Yes      |       8 |
| utf16_icelandic_ci    | utf16    | 102 |         | Yes      |       8 |
| utf16_latvian_ci      | utf16    | 103 |         | Yes      |       8 |
| utf16_romanian_ci     | utf16    | 104 |         | Yes      |       8 |
| utf16_slovenian_ci    | utf16    | 105 |         | Yes      |       8 |
| utf16_polish_ci       | utf16    | 106 |         | Yes      |       8 |
| utf16_estonian_ci     | utf16    | 107 |         | Yes      |       8 |
| utf16_spanish_ci      | utf16    | 108 |         | Yes      |       8 |
| utf16_swedish_ci      | utf16    | 109 |         | Yes      |       8 |
| utf16_turkish_ci      | utf16    | 110 |         | Yes      |       8 |
| utf16_czech_ci        | utf16    | 111 |         | Yes      |       8 |
| utf16_danish_ci       | utf16    | 112 |         | Yes      |       8 |
| utf16_lithuanian_ci   | utf16    | 113 |         | Yes      |       8 |
| utf16_slovak_ci       | utf16    | 114 |         | Yes      |       8 |
| utf16_spanish2_ci     | utf16    | 115 |         | Yes      |       8 |
| utf16_roman_ci        | utf16    | 116 |         | Yes      |       8 |
| utf16_persian_ci      | utf16    | 117 |         | Yes      |       8 |
| utf16_esperanto_ci    | utf16    | 118 |         | Yes      |       8 |
| utf16_hungarian_ci    | utf16    | 119 |         | Yes      |       8 |
| utf16_sinhala_ci      | utf16    | 120 |         | Yes      |       8 |
| cp1256_general_ci     | cp1256   |  57 | Yes     | Yes      |       1 |
| cp1256_bin            | cp1256   |  67 |         | Yes      |       1 |
| cp1257_lithuanian_ci  | cp1257   |  29 |         | Yes      |       1 |
| cp1257_bin            | cp1257   |  58 |         | Yes      |       1 |
| cp1257_general_ci     | cp1257   |  59 | Yes     | Yes      |       1 |
| utf32_general_ci      | utf32    |  60 | Yes     | Yes      |       1 |
| utf32_bin             | utf32    |  61 |         | Yes      |       1 |
| utf32_unicode_ci      | utf32    | 160 |         | Yes      |       8 |
| utf32_icelandic_ci    | utf32    | 161 |         | Yes      |       8 |
| utf32_latvian_ci      | utf32    | 162 |         | Yes      |       8 |
| utf32_romanian_ci     | utf32    | 163 |         | Yes      |       8 |
| utf32_slovenian_ci    | utf32    | 164 |         | Yes      |       8 |
| utf32_polish_ci       | utf32    | 165 |         | Yes      |       8 |
| utf32_estonian_ci     | utf32    | 166 |         | Yes      |       8 |
| utf32_spanish_ci      | utf32    | 167 |         | Yes      |       8 |
| utf32_swedish_ci      | utf32    | 168 |         | Yes      |       8 |
| utf32_turkish_ci      | utf32    | 169 |         | Yes      |       8 |
| utf32_czech_ci        | utf32    | 170 |         | Yes      |       8 |
| utf32_danish_ci       | utf32    | 171 |         | Yes      |       8 |
| utf32_lithuanian_ci   | utf32    | 172 |         | Yes      |       8 |
| utf32_slovak_ci       | utf32    | 173 |         | Yes      |       8 |
| utf32_spanish2_ci     | utf32    | 174 |         | Yes      |       8 |
| utf32_roman_ci        | utf32    | 175 |         | Yes      |       8 |
| utf32_persian_ci      | utf32    | 176 |         | Yes      |       8 |
| utf32_esperanto_ci    | utf32    | 177 |         | Yes      |       8 |
| utf32_hungarian_ci    | utf32    | 178 |         | Yes      |       8 |
| utf32_sinhala_ci      | utf32    | 179 |         | Yes      |       8 |
| binary                | binary   |  63 | Yes     | Yes      |       1 |
| geostd8_general_ci    | geostd8  |  92 | Yes     | Yes      |       1 |
| geostd8_bin           | geostd8  |  93 |         | Yes      |       1 |
| cp932_japanese_ci     | cp932    |  95 | Yes     | Yes      |       1 |
| cp932_bin             | cp932    |  96 |         | Yes      |       1 |
| eucjpms_japanese_ci   | eucjpms  |  97 | Yes     | Yes      |       1 |
| eucjpms_bin           | eucjpms  |  98 |         | Yes      |       1 |
+-----------------------+----------+-----+---------+----------+---------+
195 rows in set (0.00 sec)

I don't see utf8_croatian_ci in there. Can I add it in some way?

Quote

"%C5%A1" -- This smells a lot like it came from a utf8 encoding (hex C5A1), not latin2. Perhaps your data is really utf8? Another argument for switching from latin1/2 to utf8.

Again, these characters are not a problem and have survived from the original, they are copies of links.

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: Croatian characters not surviving LOAD DATA INFILE from UTF-8 .txt file
4101
August 23, 2011 10:38AM
2637
August 23, 2011 10:58AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.