MySQL Forums
Forum List  »  Knowledge Base

Re: ow to remove  in the SQL database
Posted by: T Delano
Date: January 29, 2009 11:01AM

Apparently php doesn't support utf8?!? adding the following before a DB insert fixes the issue:

mysql_query("SET NAMES 'utf8'");

UTF-8 uses one or more 8-bit bytes to store a single character, unlike ASCII and friends which use only one byte per character. It is more space-efficient than its cousins (UTF-16, UTF-32) when the majority of the characters can be encoded as a single byte, as is the case with most English text, but with the added benefit that you can still store any character under the sun should you need to. It uses the most significant bits of each byte as continuation bits (to signify that the following byte(s) form part of the same character). It is for this reason that improperly-displayed UTF-8 results in weird characters.


As an example, let's take a pound sign (a real pound sign for you non-British types who call a hash a pound). In ISO 8859-1, the £ character has an ordinal value of 163 (0xA3 in hex) and by coincidence (or not), its Unicode code point is U+00A3. However, UTF-8 cannot store values above 127 in a single byte -- the encoding demands we use two. Omitting the grizzly details of the actual encoding process, you end up with the 2-byte sequence 0xC2A3, which just happens to correspond to the string "£" when expressed in ISO 8859-1.


MySQL needs to know the character set of the data you are sending to it in your queries. The default connection character set is ISO 8859-1 (latin1) -- treating all your supplied data as if it was ISO 8859-1, which is then automatically converted to the underlying character set of the column (UTF-8 in our case). Taking our earlier example, this means that the two-byte pound sign is perceived as two ISO 8859-1 characters: Â and £. MySQL will then encode these characters separately as UTF-8, requiring 2 bytes each -- a sort of double-encoding. We end up using 4 bytes to store a single pound sign! When selecting data from the table, the reverse occurs -- MySQL converts the UTF-8 back into ISO 8859-1, and the user's browser (correctly) interprets the two bytes as a pound sign. The problem here is that while everything looks correct, it's needlessly using extra storage space and CPU cycles in the conversion.

Options: ReplyQuote

Written By
Re: ow to remove  in the SQL database
January 29, 2009 11:01AM

Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.