character equality under collation should not be used to test uniqueness
I was going to post this as a bug, but then I thought that it deserves discussion first.
Perhaps this should be posted as a reopening of bugs 6786 and 10684 (and others?) where
the issue was closed as "not a bug".
(I'm not an expert on collations; consider this an earnest suggestion.)
The comparison of character strings for uniqueness should not use equality under
the rules of the collation sequence as a deciding factor in the contexts of column values,
keys or constraints.
Most often, this shows up when accented characters are used in a column which is subject
to the UNIQUE keyword or used in a KEY or CONSTRAINT clause. In these contexts, the
associated "binary" collation should be used in the test for equality.
(I don't believe that this should be dismissed as a side effect of MySQL only using
the primary level weighting of the Unicode Collation Algorithm. latin1 has the same problem.)
Using MySQL 5.0.20 on SuSE 9.3.
How to repeat:
---
mysql> create database vtest default character set utf8;
Query OK, 1 row affected (0.00 sec)
mysql> connect vtest;
Connection id: 44
Current database: vtest
mysql> set names 'utf8';
Query OK, 0 rows affected (0.00 sec)
mysql> set character set 'utf8';
Query OK, 0 rows affected (0.00 sec)
mysql> create table v (vdata varchar(80) not null) engine=InnoDB;
Query OK, 0 rows affected (0.01 sec)
mysql> insert into v values ('naive');
Query OK, 1 row affected (0.00 sec)
mysql> insert into v values ('naïve');
Query OK, 1 row affected (0.00 sec)
mysql> select * from v;
+--------+
| vdata |
+--------+
| naive |
| naïve |
+--------+
2 rows in set (0.00 sec)
mysql> select distinct vdata from v;
+-------+
| vdata |
+-------+
| naive |
+-------+
1 row in set (0.00 sec)
mysql> select distinct binary(vdata) from v;
+---------------+
| binary(vdata) |
+---------------+
| naive |
| naïve |
+---------------+
2 rows in set (0.00 sec)
mysql> select * from v where vdata = 'naive';
+--------+
| vdata |
+--------+
| naive |
| naïve |
+--------+
2 rows in set (0.00 sec)
mysql> create table v2 (v2data varchar(80) unique not null) engine=InnoDB;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into v2 values ('naive');
Query OK, 1 row affected (0.00 sec)
mysql> insert into v2 values ('naïve');
ERROR 1062 (23000): Duplicate entry 'naïve' for key 1
mysql> create table v3 (v3data varchar(80) not null, constraint v3.uniq unique (v3data)) engine=InnoDB;
Query OK, 0 rows affected (0.01 sec)
mysql> insert into v3 values ('naive');
Query OK, 1 row affected (0.04 sec)
mysql> insert into v3 values ('naïve');
ERROR 1062 (23000): Duplicate entry 'naïve' for key 1
---
Suggested fix:
---
The uniqueness tests in v2 and v3 should use utf8_bin instead of uft8_general_ci.
Note that the utf8_general_ci collation is used for comparisons under the SELECT
DISTINCT keyword and the WHERE clause.
---
- Greg Kresko (Greg dot Kresko at nrc-cnrc dot gc dot ca)