National Character vs UTF-8 for cross vendor database schema?
I am working on an existing product that we are in the process of i18n'ing the database schema, which is also supported across multiple database vendors (i.e MySQL, SQLServer 2005, etc). The problem that I am running into is whether I should have the entire database under MySQL as UTF8 or just individual columns that need to support Unicode.
Before I go any further, let me give some background:
We use ERStudio for the db design tool, and have one logical model that we then can generate the individual physical models for the various database platforms. There are approx 100+ tables in the logical model. What we've done is for any column that needs to support unicode data, we set the datatype to NVARCHAR in the logical. Then when the logical model is exported to the MySQL physical model, it generates a column type of NATIONAL VARCHAR for any NVARCHAR datatype. Then within the SQL Server physical model, any NVARCHAR column is still a NVARCHAR datatype.
By default, MySQL is selected to be installed with 'lantin1' as the character set. A lot of our customers have installed MySQL this way, while others have installed with utf8 as the character set. So we have a mix of types of installations for our database schema at our customers.
Anyhow, I need to support new installations of our database schema as well as convert any exsiting databases to the latest schema with utf8 support. I also want to reduce any problems if a customer decides to go from SQL Server to MySQL or vice-versa....that the data won't get corrupted.
One quick way for the conversion that I was thinking of doing was converting each individual table to utf8 and then have the default character set for the database shema changed to utf8. For new installations, I would have the database schema set to UTF8 by default, which in turn would make all new tables utf8. Both of these solutions pretty much makes every field in all the databases as utf8, which would be different for any of my other database vendor installations.
ALTER TABLE tbl_name DEFAULT CHARACTER SET utf8;
ALTER DATABASE DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Another solution would be for the default installation of the schema would only have the columns with the NATIONAL VARCHAR be utf8 (which would be done automatically by the data type - column specific) and leave the other columns and tables as the database default. Then for the conversion, any column that needs to support utf8 would be changed via this command:
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
And leave all the other columns alone.
My problem is that I don't know what the downfall is between having the entire database vs only the selected columns as utf8. If I knew we were only ever going to support MySQL...then it would be a no brainer (full utf8). But since we have to be able to have the option of converting from one database vendor to another....it makes it more difficult.
System Details:
Supported Databases: MySQL 5.0.18, SQLServer 2005
JDBC: J/Connector 5.0.7
JDK: JDK 1.5.0.12
J2EE: JBoss 4.1.2
OS: Windows XP, Server 2003
Sorry the email is so long. Any help, insights, or expert advice would be greatly appreciated.
-Aaron