Surrogate pairs? Non-BMP characters?
Posted by: Maurizio Gavioli
Date: July 31, 2009 04:29AM


I have 2 in-house apps using 2 DB's under MySQL 6.0.10 (I know, I know...) which use non-BMP characters. Everything is in utf8 (the REAL, 1-to-4-bytes, utf8 used by 6.0) and everything works nicely.

Then 6.0 disappeared and now I am evaluating whether to move to some more solid ground; this would mean to step BACK to either 5.1 or 5.4.

Neither of them fully support Unicode, being limited to the BMP only. Which brings a few questions:

1) How are surrogates treated in 5.x partial utf8?

If they are treated as 'normal' BMP chars (i.e. with an utf8 equivalent of two 3-byte sequences, instead of the standard single 4-byte sequence), it would be something: it would require an additional layer of conversion, but all in all manageable. But if they are dropped or converted to some garbage, this would be unacceptable, of course.

2) How are they treated in 5.x ucs2? Again, if they are considered like 'normal' BMP chars (i.e. stored and retrieved as such), OK. If they are dropped or ?-ed, nope.

3) I expect they will not be collated correctly in any supported collation (I currently use utf8-general-ci): am I right?

4) As a general indication, would you suggest usc2 or utf8? Storage size is not an issue (texts are in Chinese and UTF-16 is on average more compact than UTF-8 for Hanzi). I expect ucs2 to raise less problems with non-BMP chars, but I have never used it.

5) Data bases are small (below 100 MB), use MyISAM engine and are mostly made of textual excerpts: any reason to prefer 5.4 to 5.1? I admit this question is very general, but I jumped from 4.x to 6 and I never used any 5.x version.

6) Lastly, any plan to restore a FULL Unicode support, sooner or later?

I searched the forums for "surrogate(s)" or "BMP", but found nothing (beyond surrogate keys and BitMaPs, of course). I know I could just install a 5.x version and try, but maybe the experience already built up by the community can save me some time and from falling in some pitfall.

Many thanks,

Maurizio M. Gavioli

Options: ReplyQuote

Written By
Surrogate pairs? Non-BMP characters?
July 31, 2009 04:29AM

Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.