Re: Chinese sorting and filtering on an table with charset UTF-8 (solved)
Posted by: Tobias Prinz
Date: March 11, 2011 11:22AM

This is the order that Chinese people usually expect:
阿,波,次,的,鹅,富,哥,河,洁,科,了,么,呢,哦,批,七,如,四,踢,屋,西,衣,子

They are also used to something like this:
阿,波,次,的,鹅,哦,富,哥,河,洁,科,了,么,呢,批,七,如,四,踢,屋,西,衣,子

The 哦 can be pronounced both "o" or "e". For sorting purposes you'd expect the first (Java's RuleBasedCollator does it), if you use a standard translator (technically this would be a transliterator, I assume) to Pinyin like pinyin4j you'll get the second (pinyin translators give several options, usually the results are sorted alphabetically the way us westerners are used to, so you get the "e"-variant instead of the "o" one first). As I've been told, this is not perfect but generally accepted (the iPhone does it like that and Steve could never ever be wrong, right?^^)

If you use the unicode collators for MySQL you get something that's completely off.

Why that is: No clue. It is not related to MySQL (so the fact that CHARSET=utf8 misses the fourth byte is not relevant here), as the problem exists in Java, too. So it is probably something Unicode-specific. Maybe the order within Unicode is by radicals and strokes - haven't checked that out yet because I found a solution that makes Chinese people happy (gb2312 and gbk are used in the PRC, Taiwan & Hongkong use big5), so my work is done ;-)

Sorry if I could not clear it up completely, but if you want to dig into it, you have a set of test data now.

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: Chinese sorting and filtering on an table with charset UTF-8 (solved)
3821
March 11, 2011 11:22AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.