MySQL Forums
Forum List  »  Newbie

Re: Charset and Collation Question
Posted by: Dean Richert
Date: November 16, 2014 05:58AM

Rick James Wrote:
-------------------------------------------------------
> > I want to support pretty much every major
> language in all of my tables. What should my
> database’s charset and collation be?
>
> utf8mb4, utf8mb4_general_ci
>
> Be sure to say utf8mb4 all over the place. In
> particular, after connecting (in your app), do
> SET NAMES utf8mb4
> This is assuming that the data you have is encoded
> as utf8. If not, please elaborate.
>

It's all utf8.

> > with some collations because something like
> value may be equivalent to valúe
>
> The *_ci collations are "Case Insensitive", but
> actually they also strip accents before comparing.


Didn't know that. So there is no way to be case insensitive but also not strip accents (or finesse it in someway)?

Your page suggests no: "Alas, MySQL's collations think of case folding and accent-stripping as equivalent. That is, there is no utf8 collation for case folding, but keeping accents distinct. Do SHOW COLLATION LIKE '%cs'; to see the few collations that work that way."

> Hence, these are all the same in a _ci
> collation:
> value ValuE valúe
>
> > u to be equivalent to ú
>
> All the Western European languages work pretty
> much the same for all the accents (as far as I
> know). The blog Peter pointed you at gives a list
> of equivalences for many of the utf8 collations.
> The utf8mb4 collations should be identical (in
> that area).
>
> I suggested utf8mb4, not utf8, because Chinese was
> in your list. (See the section in that blog.)
>

Thanks, that was part of what I needed to know.


> > And if I wanted to support the languages above
> but NOT allow u to be equivalent to ú, which
> charset and collation should I choose?
>
> Do you have an example of such a language? I need
> to learn about it.
>

I don't know of any language like that either, but the case I am considering is a statement similar to this:

SELECT name FROM users WHERE name = 'Louie'

I would like for this to select only the user with the name Louie, and not potentially someone named Loúie.

On the other hand, I use autocompletes that make heavy use of LIKE so I would like it to be the case that both Louie and Loúie appear as results when searching for either of them as the value of the like clause.


> utf8mb4_bin collation compares bits -- no case
> folding, no accent stripping, etc. Hence the 3
> "values" compare as different.
>
> Collation is for determining equality (SELECT ...
> WHERE x = '...') and for ordering (ORDER BY x).
>
> It is probably best to use utf8mb4_general_ci for
> all your strings. Then if you have exceptions (u
> not always equal to ú), you handle it either in
> your app, or by specifying the collation to use
> when performing the SQL -- caution: this tends to
> prevent use of INDEXes, hence be slower.
>

Okay - I will still be able to use indexes with utf8mb4_general_ci, but if I set the collation for a particular statement that is contrary to the what the table or column has as a default, that particular statement may not make use of indexes.


> MySQL allows a different CHARACTER SET and
> COLLATION for each column of each table.
>
> The only way (that I know of) to display mixed
> languages (eg Chinese and German) on the same web
> page is to encode the entire page in utf8.
>
> Note: MySQL's "utf8" CHARACTER SET is limited to
> 3-byte encodings. "utf8mb4" also includes 4-byte
> encodings, which Chinese is getting more and more
> of.
> Outside MySQL (eg, web pages), "utf8" refers to
> encodings of all lengths.

So if I understand you correctly, I should use utf8mb4_bin so that WHERE name = 'Louie' only selects Louie, not Loúie or louie (lowercase), and then use utf8mb4_general_ci when I want a LIKE clause to return both Louie and Loúie for 'Louie' or 'louie' (lower case).

Thank you for your help. I am going to have to devote the rest of the day to reading through your site.

Options: ReplyQuote


Subject
Written By
Posted
November 15, 2014 08:55AM
Re: Charset and Collation Question
November 16, 2014 05:58AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.