UTF8, latin1 and performance impact
Hi MySQL people,
Here's the story. I have 2 tables, in latin1, and a copy of these in UTF8 (since data itself is 100% English language strings for now).
Let's name the tables table_a_latin1, table_b_latin1, table_a_utf8, table_b_utf8 (utf8_unicode_ci).
These tables use MyISAM, they have about 6M row each, rows are rather long (as in 30 attributes, all strings for table_a, and 20 for table_b). Both have their PRIMARY INDEX on a int 'id', and a UNIQUE index on the first column after that: 'name' VARCHAR(10).
I have also checked table sizes:
table_a: 857M
table_a indexes: 148M for latin1 and 115M for UTF8 (how is this possible?)
table_b: 1.4G
table_b indexes: 356M for latin1 and 304M for UTF8 (...?)
I am comparing time spent on INNER JOIN.
Everything looking equal to me, I want to compare time spent on a query like:
SELECT * FROM 'table_a' tablea INNER JOIN 'table_b' tableb ON tablea.'name' = tableb.'name';
When joining the 2 latin1 tables, query is like 3 times faster than the join on their utf8 equivalent.
For instance, a SELECT COUNT(*) on the latin1 tables takes 35s, and 1min50s on the UTF8 tables.
When using EXPLAIN, the index is correctly identified, difference being that key_len is 12 for latin1 and 32 for UTF8.
I'd like to stress that the data is from English language only (at least for now). If I got it right, UTF8 doesn't use more than the one byte to encode a latin1 character. So a comparison between 2 utf8 strings should not take much longer than the one on latin1, or should it?
Considering the impact on performance (it's a somewhat frequent join), it's really important to me to understand what is happening, and if I'm doing something wrong.
Let me know if some info is missing.