utf8 collating in mysql 5.0.16
Posted by: Rand Childs
Date: December 20, 2005 03:57PM

Yesterday I installed mysql 5.0.16 on my Mac OS X 10.4.3 system and am trying to understand how utf8 collating works as defined by the UCS (Unicode Collating Sequence Standard) and how it compares to Oracle's GENERIC_M.

I entered the following list from the UCS report http://www.unicode.org/reports/tr10/ in the section discussing variable weighting in both Oracle configured for utf8 and GENERIC_M and myql 5.0.16 configured for utf8 and utf8_general_ci. Oracle collates the data as follows:

de luge
de Luge

whereas mysql collates the data as follows:

| de luge |
| de Luge |
| de-Luge |
| de-luge |
| death |
| deluge |
| deLuge |
| demark |
| de‐Luge |
| de‐luge |

and neither collations agree with the table in "variable weighting" for any of the 4 possible variable weighting options. In fact mysql just looks wrong. For example the capital L collates before the lower case l if preceeded by a hyphen-minus or hyphen. In Oracle the hyphen-minus (002D) and the hyphen (2010) appear to collate together after the space in a tertiary position. mysql on the other hand make the space significant and doesn't collate the two hyphen's together.

Does anyone know why mysql is collating this data as it is? It doesn't look like it is collating according to the UCS or ISO 14651. Are others seeing these problems with the utf8 collating sequences that are supposed to conform to the UCS? Is there a way to fix it?



Options: ReplyQuote

Written By
utf8 collating in mysql 5.0.16
December 20, 2005 03:57PM

Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.