Chinese sorting and filtering on an table with charset UTF-8
Posted by: Tobias Prinz
Date: March 08, 2011 03:31AM

Hello there!

I've posted the same question to the mailing list, but I might have gotten the wrong one or it might be frequented less often. So sorry in case you're reading this twice:

I'm trying to get some software ready to work with Chinese/Hanzi characters/graphems/sinograms. My tables, using mysql 5.x, use UTF-8 for all text data. It is not an option to change that for a local Chinese installation.

Storing data works quite nicely. That's what I like about UTF-8.

But sorting does not work. And, related to that, comparisons neither. I'm getting the (hopefully silly) notion that there is no collation for CHARSET=UTF-8 that could sort according to the rules of GB2312 (which, if it was true, would probably an interesting story to explain why).

Hack #1: This leads me to fear that I might have to use a Pinyin translation library (like pinyin4j, which only translates character by character and is not bijective), store that in a separate column and sort that. Which would only give approximate results since without a huge dictionary backing, the lack of bijective mapping between Hanzi and Pinyin will result in a lack of information.

Hack #2: Or I could create a separate column with GB2312 as encoding and sort by this using the collation I want (gb2312_chinese_ci most likely).

The latter solution would provide better sorting than the former, but require me to do the same thing for other countries, too (I assume there'd be people interested in having big5 and that is still the same language, so let's not talk about Hindi...).

Hack #3: Or I could do it one level above the database layer - which is possible, but I assume any code I could write would be less sophisticated (and thereby slower) than what the database can do.

Previous research: I've seen the mail threads "Chinese order by with utf8" and "Chinese and MySQL / UTF8 and versions" which are a bit older than I'd like and also don't seem conclusive. I have seen forum threads "Using English and Chinese in the same database" (,375091,375091), which does not mention any collation at all and "specific collation issue" (,399245,399245) which seems to describe my problem, except with some ancient Egyptian text.

So what are my options here? I'm not married to the idea of using GB2312 - I would be willing to use any other Chinese collation, as long as it is better than the default UTF-8 one (which sorts by numerical value). But I really dislike my proposed hacks #1, #2 and #3.

Thanks in advance,

Options: ReplyQuote

Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.