MySQL Forums
Forum List  »  Newbie

Re: Charset and Collation Question
Posted by: Rick James
Date: November 15, 2014 01:23PM

> I want to support pretty much every major language in all of my tables. What should my database’s charset and collation be?

utf8mb4, utf8mb4_general_ci

Be sure to say utf8mb4 all over the place. In particular, after connecting (in your app), do
SET NAMES utf8mb4
This is assuming that the data you have is encoded as utf8. If not, please elaborate.

> with some collations because something like value may be equivalent to valúe

The *_ci collations are "Case Insensitive", but actually they also strip accents before comparing. Hence, these are all the same in a _ci collation:
value ValuE valúe

> u to be equivalent to ú

All the Western European languages work pretty much the same for all the accents (as far as I know). The blog Peter pointed you at gives a list of equivalences for many of the utf8 collations. The utf8mb4 collations should be identical (in that area).

I suggested utf8mb4, not utf8, because Chinese was in your list. (See the section in that blog.)

> And if I wanted to support the languages above but NOT allow u to be equivalent to ú, which charset and collation should I choose?

Do you have an example of such a language? I need to learn about it.

utf8mb4_bin collation compares bits -- no case folding, no accent stripping, etc. Hence the 3 "values" compare as different.

Collation is for determining equality (SELECT ... WHERE x = '...') and for ordering (ORDER BY x).

It is probably best to use utf8mb4_general_ci for all your strings. Then if you have exceptions (u not always equal to ú), you handle it either in your app, or by specifying the collation to use when performing the SQL -- caution: this tends to prevent use of INDEXes, hence be slower.

MySQL allows a different CHARACTER SET and COLLATION for each column of each table.

The only way (that I know of) to display mixed languages (eg Chinese and German) on the same web page is to encode the entire page in utf8.

Note: MySQL's "utf8" CHARACTER SET is limited to 3-byte encodings. "utf8mb4" also includes 4-byte encodings, which Chinese is getting more and more of.
Outside MySQL (eg, web pages), "utf8" refers to encodings of all lengths.

Options: ReplyQuote


Subject
Written By
Posted
November 15, 2014 08:55AM
Re: Charset and Collation Question
November 15, 2014 01:23PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.