Surrogate pairs? Non-BMP characters?
Hi,
I have 2 in-house apps using 2 DB's under MySQL 6.0.10 (I know, I know...) which use non-BMP characters. Everything is in utf8 (the REAL, 1-to-4-bytes, utf8 used by 6.0) and everything works nicely.
Then 6.0 disappeared and now I am evaluating whether to move to some more solid ground; this would mean to step BACK to either 5.1 or 5.4.
Neither of them fully support Unicode, being limited to the BMP only. Which brings a few questions:
1) How are surrogates treated in 5.x partial utf8?
If they are treated as 'normal' BMP chars (i.e. with an utf8 equivalent of two 3-byte sequences, instead of the standard single 4-byte sequence), it would be something: it would require an additional layer of conversion, but all in all manageable. But if they are dropped or converted to some garbage, this would be unacceptable, of course.
2) How are they treated in 5.x ucs2? Again, if they are considered like 'normal' BMP chars (i.e. stored and retrieved as such), OK. If they are dropped or ?-ed, nope.
3) I expect they will not be collated correctly in any supported collation (I currently use utf8-general-ci): am I right?
4) As a general indication, would you suggest usc2 or utf8? Storage size is not an issue (texts are in Chinese and UTF-16 is on average more compact than UTF-8 for Hanzi). I expect ucs2 to raise less problems with non-BMP chars, but I have never used it.
5) Data bases are small (below 100 MB), use MyISAM engine and are mostly made of textual excerpts: any reason to prefer 5.4 to 5.1? I admit this question is very general, but I jumped from 4.x to 6 and I never used any 5.x version.
6) Lastly, any plan to restore a FULL Unicode support, sooner or later?
I searched the forums for "surrogate(s)" or "BMP", but found nothing (beyond surrogate keys and BitMaPs, of course). I know I could just install a 5.x version and try, but maybe the experience already built up by the community can save me some time and from falling in some pitfall.
Many thanks,
Maurizio M. Gavioli