Re: Charset and Collation Question
Posted by:
Rick James
Date: November 15, 2014 01:23PM
> I want to support pretty much every major language in all of my tables. What should my database’s charset and collation be?
utf8mb4, utf8mb4_general_ci
Be sure to say utf8mb4 all over the place. In particular, after connecting (in your app), do
SET NAMES utf8mb4
This is assuming that the data you have is encoded as utf8. If not, please elaborate.
> with some collations because something like value may be equivalent to valúe
The *_ci collations are "Case Insensitive", but actually they also strip accents before comparing. Hence, these are all the same in a _ci collation:
value ValuE valúe
> u to be equivalent to ú
All the Western European languages work pretty much the same for all the accents (as far as I know). The blog Peter pointed you at gives a list of equivalences for many of the utf8 collations. The utf8mb4 collations should be identical (in that area).
I suggested utf8mb4, not utf8, because Chinese was in your list. (See the section in that blog.)
> And if I wanted to support the languages above but NOT allow u to be equivalent to ú, which charset and collation should I choose?
Do you have an example of such a language? I need to learn about it.
utf8mb4_bin collation compares bits -- no case folding, no accent stripping, etc. Hence the 3 "values" compare as different.
Collation is for determining equality (SELECT ... WHERE x = '...') and for ordering (ORDER BY x).
It is probably best to use utf8mb4_general_ci for all your strings. Then if you have exceptions (u not always equal to ú), you handle it either in your app, or by specifying the collation to use when performing the SQL -- caution: this tends to prevent use of INDEXes, hence be slower.
MySQL allows a different CHARACTER SET and COLLATION for each column of each table.
The only way (that I know of) to display mixed languages (eg Chinese and German) on the same web page is to encode the entire page in utf8.
Note: MySQL's "utf8" CHARACTER SET is limited to 3-byte encodings. "utf8mb4" also includes 4-byte encodings, which Chinese is getting more and more of.
Outside MySQL (eg, web pages), "utf8" refers to encodings of all lengths.