character equality under collation should not be used to test uniqueness
Posted by: Greg Kresko
Date: April 19, 2006 04:03PM

I was going to post this as a bug, but then I thought that it deserves discussion first.

Perhaps this should be posted as a reopening of bugs 6786 and 10684 (and others?) where
the issue was closed as "not a bug".

(I'm not an expert on collations; consider this an earnest suggestion.)

The comparison of character strings for uniqueness should not use equality under
the rules of the collation sequence as a deciding factor in the contexts of column values,
keys or constraints.

Most often, this shows up when accented characters are used in a column which is subject
to the UNIQUE keyword or used in a KEY or CONSTRAINT clause. In these contexts, the
associated "binary" collation should be used in the test for equality.

(I don't believe that this should be dismissed as a side effect of MySQL only using
the primary level weighting of the Unicode Collation Algorithm. latin1 has the same problem.)

Using MySQL 5.0.20 on SuSE 9.3.

How to repeat:
---
mysql> create database vtest default character set utf8;
Query OK, 1 row affected (0.00 sec)

mysql> connect vtest;
Connection id: 44
Current database: vtest

mysql> set names 'utf8';
Query OK, 0 rows affected (0.00 sec)

mysql> set character set 'utf8';
Query OK, 0 rows affected (0.00 sec)

mysql> create table v (vdata varchar(80) not null) engine=InnoDB;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into v values ('naive');
Query OK, 1 row affected (0.00 sec)

mysql> insert into v values ('naïve');
Query OK, 1 row affected (0.00 sec)

mysql> select * from v;
+--------+
| vdata |
+--------+
| naive |
| naïve |
+--------+
2 rows in set (0.00 sec)

mysql> select distinct vdata from v;
+-------+
| vdata |
+-------+
| naive |
+-------+
1 row in set (0.00 sec)

mysql> select distinct binary(vdata) from v;
+---------------+
| binary(vdata) |
+---------------+
| naive |
| naïve |
+---------------+
2 rows in set (0.00 sec)

mysql> select * from v where vdata = 'naive';
+--------+
| vdata |
+--------+
| naive |
| naïve |
+--------+
2 rows in set (0.00 sec)

mysql> create table v2 (v2data varchar(80) unique not null) engine=InnoDB;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into v2 values ('naive');
Query OK, 1 row affected (0.00 sec)

mysql> insert into v2 values ('naïve');
ERROR 1062 (23000): Duplicate entry 'naïve' for key 1
mysql> create table v3 (v3data varchar(80) not null, constraint v3.uniq unique (v3data)) engine=InnoDB;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into v3 values ('naive');
Query OK, 1 row affected (0.04 sec)

mysql> insert into v3 values ('naïve');
ERROR 1062 (23000): Duplicate entry 'naïve' for key 1

---

Suggested fix:
---
The uniqueness tests in v2 and v3 should use utf8_bin instead of uft8_general_ci.
Note that the utf8_general_ci collation is used for comparisons under the SELECT
DISTINCT keyword and the WHERE clause.

---

- Greg Kresko (Greg dot Kresko at nrc-cnrc dot gc dot ca)

Options: ReplyQuote


Subject
Views
Written By
Posted
character equality under collation should not be used to test uniqueness
5805
April 19, 2006 04:03PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.