Can't get REGEXP to work properly with utf8
Posted by: Oskar Joelson
Date: November 15, 2005 06:54AM

Hi!

I recently changed the collation/character set in the whole database for a project I'm working on to utf8. After a while I realized that REGEXP didn't work as it used to. I looked in the manual and saw that REGEXP was not multibyte safe. For example, if I have column called 'info' in a table 'news' and a row with the string "början" in colum 'info' and I would use these queries:

SELECT * FROM news WHERE REGEXP "början";
SELECT * FROM news WHERE REGEXP "BÖRJAN";
(I have simplified these queries so it would be more obvious, ordinarily the regexp's are much more complex, so I cannot for example use LIKE instead)

This used to give the same result, I would get the row containing the string "början" whatever case I used. However, when I changed the database to utf8 (from latin1) the second query wouldn't work. I can still do REGEXP searches with different case as long as the characters are non-"foreign" (åäö and so on).

I found this in http://dev.mysql.com/doc/refman/5.0/en/string-comparison-functions.html :
"REGEXP and RLIKE use the current character set (cp1252 Latin1 by default) when deciding the type of a character. Warning: These operators are not multi-byte safe."

Does this mean that REGEXP will only work with "foreign" characters if the column has a latin1 collation? Or does it mean that I can change the "current character set" to utf8 and REGEXP will work with "foreign" characters? Because I have tried that and it will still not work.
Everything is set to utf8: character_set_client, character_set_connection, character_set_database, character_set_results, character_set_server, character_set_system, and all collations are utf8. I have changed the MySQL-configuration (my.ini) so that default-character-set is utf8_general_ci (I have tried utf8_swedish_ci too) for both client and server. It wont work. But if I change collation only for the column I'm doing REGEXP's on to latin1 it works!

Am I forced to use latin1 if I want to do proper REGEXP selects? Has anyone else had this problem? I would be ever grateful for any help.

Thanks
/Oskar

Options: ReplyQuote


Subject
Views
Written By
Posted
Can't get REGEXP to work properly with utf8
5825
November 15, 2005 06:54AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.