MySQL Forums
Forum List  »  Newbie

Re: Charset and Collation Question
Posted by: Rick James
Date: November 16, 2014 02:29PM

> So there is no way to be case insensitive but also not strip accents (or finesse it in someway)?

If I ever find a way, I will eagerly add it to the blog. I see it (treating case and accent separately) as a glaring deficiency.

I see that the blog fails to point out some of the things we have been talking about. Here is a new section:
http://mysql.rjweb.org/doc.php/charcoll#utf8_collations_examples

> I use autocompletes that make heavy use of LIKE so I would like it to be the case that both Louie and Loúie appear as results when searching for either of them as the value of the like clause.

I recommend utf8mb4_unicode_ci so that the autocomplete's LIKE will be fast. If you then need to switch to utf8_bin, for a specific test, it will be slow, but perhaps acceptable. (The index is abandoned when you use the "COLLATE xx" clause.)

> So if I understand you correctly, I should use utf8mb4_bin so that WHERE name = 'Louie' only selects Louie, not Loúie or louie (lowercase), and then use utf8mb4_general_ci when I want a LIKE clause to return both Louie and Loúie for 'Louie' or 'louie' (lower case).

Correct. The new section also covers a another case: 'combining' accents.

A programming technique that solves your category of problem is ...
Have 2 columns instead of one; let's call them `exact` and `canonical`. `exact` contains the correctly spelled (caps, accents, etc) value, and it has a suitable collation. `canonical` is a programmatically-altered version of `exact`. For you, this may include stripping accents and folding to lowercase. In other programming situations, it may involve striping redundant whitespace (blanks, newlines, tabs), removing markup (such as html <tags>), etc. `canonical` would also have an index, possibly FULLTEXT and/or utf8mb4_bin. `canonical` may have duplicates even though `exact` probably does not.

As you load the table, you store `exact` as is, then transform it to get `canonical`. Likewise, search words (or auto_complete prefixes) would have to be modified before searching `canonical`.

Your autocorrect involves searching `canonical` for 'louie'. It will get two hits, then display 'Louie' and 'Loúie' (from `exact`).

Options: ReplyQuote


Subject
Written By
Posted
November 15, 2014 08:55AM
Re: Charset and Collation Question
November 16, 2014 02:29PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.