ruby collation
Posted by: mysql
Date: June 23, 2005 01:05PM

Hello, I have a funky idea I'd like to throw at the MySQL gurus out there and see what they think in terms of feasibility.

For the past year I've been working in a Japanese environment, and something that astonished me is that there doesn't seem to be any good way to sort Japanese text. Because a kanji can have multiple readings, it can be sorted in different ways depending on how it's meant to be read. So various solutions are used, including:
1. use JIS code order which is sort-of-mostly-kinda-phonetic-but-not-exactly
2. use a dictionary to translate words to phonetic kana
3. store both a "display" and a "sort/search" field in the database

Being a student of Japanese, I make extensive use of furigana (a.k.a. ruby annotation) indicating the pronunciation above a given kanji. Now, since MySQL 4.1 supports unicode, and since unicode includes characters for ruby annotation... would it be possible to create a collation that makes use of ruby-annotated unicode text?

The 3 ruby annotation characters are:
<FFF9> INTERLINEAR ANNOTATION ANCHOR
<FFFA> INTERLINEAR ANNOTATION SEPARATOR
<FFFB> INTERLINEAR ANNOTATION TERMINATOR

By using those characters in a string, we can end up with all kind of interesting cases like:
"A" == "<FFF9>A<FFFA>capital a<FFFB>" == "capital a"
"人" == "<FFF9>人<FFFA>ひと<FFFB>" == "ひと"
"<FFF9>人<FFFA>ひと<FFFB>" =?= "<FFF9>人<FFFA>じん<FFFB>"

I think that the ability to arbitrarily define new equalities like that is incredibly cool. Of course I have no idea how a sorting algorithm would react to the fact that A=B and B=C but A<>C. Probably the best would be to use only the second part for sorting. Actually, I'm not even sure this is a matter of collation. Maybe it's something that can/should be implemented in the comparison/sorting algorithm.

So, does anyone think this is feasible? Does it even make sense? Am I crazy?

Options: ReplyQuote


Subject
Views
Written By
Posted
ruby collation
4792
June 23, 2005 01:05PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.