Re: utf8_unicode_ci vs utf8_general_ci
Posted by: Alexander Barkov
Date: December 18, 2007 04:03PM


You can check and compare sort orders provided by these two collations here:

utf8_general_ci is a very simple collation. What it does - it just
- removes all accents
- then converts to upper case
and uses the code of this sort of "base letter" result letter to compare.

For example, these Latin letters: ÀÁÅåāă (and all other Latin letters "a"
with any accents and in any cases) are all compared as equal to "A".

utf8_unicode_ci uses the default Unicode collation element table (DUCET).

The main differences are:

1. utf8_unicode_ci supports so called expansions and ligatures, for example:
German letter ß (U+00DF LETTER SHARP S) is sorted near "ss"
Letter Π(U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".

utf8_general_ci does not support expansions/ligatures, it sorts
all these letters as single characters, and sometimes in a wrong order.

2. utf8_unicode_ci is *generally* more accurate for all scripts.
For example, on Cyrillic block:
utf8_unicode_ci is fine for all these languages:
Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian.
While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic.
Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian
are sorted not well.

The disadvantage of utf8_unicode_ci is that it is a little bit
slower than utf8_general_ci.

So when you need better sorting order - use utf8_unicode_ci,
and when you utterly interested in performance - use utf8_general_ci.

Options: ReplyQuote

Written By
December 07, 2007 12:47AM
Re: utf8_unicode_ci vs utf8_general_ci
December 18, 2007 04:03PM

Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.