Re: load data infile ... with UTF-8 characters
Posted by: Bruce Malmat
Date: September 23, 2015 11:49AM

And the problem is back! I really can't understand this.
HISTORY:
Tables are declared UTF-8.

The LOAD DATA INFILE code originally declared CHARACTER SET 'utf8'
UTF8 chars loaded fine ... for a while.
Then it started to fail. Lots of perfectly proper UTF8 chars were loading as junk.
Probably converted (wrongly!) to 2-byte UTF16?

I changed the code to CHARACTER SET 'latin', based on ten-year-old advice on a forum. It worked properly again!

That lasted a week. Failure once again. Changed back to CHARACTER SET 'utf8' as before.
But I prefixed the load with SET NAMES 'utf8';
Like this:
SET NAMES 'utf8';
LOAD DATA INFILE 'C:\\Users\\joe\\Desktop\\test\\suffix.mod.tab'
IGNORE
INTO TABLE catalog.s1_suffix
CHARACTER SET 'utf8'
FIELDS TERMINATED BY '\t' ENCLOSED BY '' ESCAPED BY '\\'
LINES TERMINATED BY '\n' STARTING BY ''
(sg_ID, id_stamp, @var_descrip, @var_prices, @var_suffix, @var_suffix_ind)
set
description = trim(@var_descrip),
prices = trim(@var_prices),
suffix = trim(@var_suffix),
suffix_ind = trim(@var_suffix_ind);
That worked for about a week. Now it fails.

One specific case (of many) has left- and right-quotes, as in this text that appears in a one-line/one-row load file:
With "Belgica 2001" Interational

The text above shows plain/dumb ASCII quotes because I can't do them right in this forum text-box.
But the actual load file has proper UTF8 left/right quotes. Here's its hex dump:

hex: 57 69 74 68 20 e2 80 9c 42 65 6c 67 69 63 61 20
hex: 32 30 30 31 e2 80 9d 20 49 6e 74 65 72 6e

char: With e2 80 9c Belgica
char: 2001 e2 80 9d Intern...
The e2 80 9c and e2 80 9d codes are the correct left- and right-quote encodings.

Why does this work ... and then not?

Why does it work when the offending quotes are NOT in the first row of the flat-file?
(I added another row into the load file. That row had only pure ASCII. The one with the quotes became row 2. And then it loaded properly!)

And why is it that the load runs to completion when faced with other "offending" UTF8 characters (other than dbl-quotes) ... but nevertheless the load is "false" because those UTF8 chars get converted to wrong (UFT16?) chars.

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: load data infile ... with UTF-8 characters
1812
September 23, 2015 11:49AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.