Unicode collation algorithm
Posted by: Brooks Brown
Date: April 12, 2005 04:42PM

I made a post earlier today expressing my frustration at 'a' matching a-acute and similar comparison results. I'd wanted to gain a deeper understanding of this problem, so I took a look at the server code that implements the UCA.

It looks like an important source file is ctype-uca.c. This contains 256 tables containing 16 bit numbers that are grouped into sets of 3, 4, or 5 depending on the table. Many of the sets only contain one non-zero entry.

Comparing this to allkeys.txt at http://www.unicode.org/Public/UCA/latest/allkeys.txt it looks like there is a lot of data that has been zeroed out. Is this what the comment at the top of the source file "Only Primary level key comparison" means?

For my situation, with diacritical marks, it looks like allkeys.txt has two collation entries, one identical to that of the unadorned character, and the other identical to that of the combining character. For example,

0061 ; [.0E33.0020.0002.0061] # LATIN SMALL LETTER A
00E1 ; [.0E33.0020.0002.0061][.0000.0032.0002.0301] # LATIN SMALL LETTER A WITH ACUTE; QQCM

There appears to be no place for the second entry in the tables in the source file. Is this what the comment "No combining marks processing is done" means?

One option for my company would be to try to contribute to the mysql source code and in this way provide a solution to our problem.

Options: ReplyQuote

Written By
Unicode collation algorithm
April 12, 2005 04:42PM

Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.