Re: UTF8, latin1 and performance impact
Posted by: Rick James
Date: March 11, 2014 07:53AM

Hmmm...

> So a comparison between 2 utf8 strings should not take much longer than the one on latin1, or should it?

I agree with you.

> If I got it right, UTF8 doesn't use more than the one byte to encode a latin1 character.

Correct. (As are other facts you state.)

> key_len is 12 for latin1 and 32 for UTF8.

2 bytes for length, plus 10 or 3*10 bytes for the characters. However, if they are VARCHAR, trailing spaces are left off (most of the time), and English letters take 1 byte, not 3. Hence (most of the time), the actual strings are the same size.

> table_a indexes: 148M for latin1 and 115M for UTF8 (how is this possible?)

How were the indexes built? If table_a's index grew over time but you built the table_b all at once, then the utf8 index might simply be better compacted. Consider doing ALTER TABLE table_a... ENGINE=MyISAM to rebuild its indexes. Then check the sizes. A 30% difference is consistent with the difference between a messed up BTree versus a compacted one.

utf8_unicode_ci is quite complex to do comparisons. Still, that probably does not explain the 3x difference in speed.

> SELECT * FROM 'table_a' tablea INNER JOIN 'table_b' tableb ON tablea.'name' = tableb.'name';
> When joining the 2 latin1 tables, query is like 3 times faster than the join on their utf8 equivalent.

Run each query twice, and avoid the Query cache. This will avoid I/O overhead the first time you run it on a cold system or after the key_buffer has been chruned.

SHOW VARIABLES LIKE 'key_buffer_size'; -- It should be something like 20% of available RAM, and it would be good if it were bigger than the indexes.

> For instance, a SELECT COUNT(*) on the latin1 tables takes 35s, and 1min50s on the UTF8 tables.
> These tables use MyISAM

That's another inconsistency -- MyISAM dead-reckons COUNT(*) (without a WHERE clause), so it takes essentially no time. Please provide the actual SELECT that took that long; there is something else going on.

> Both have their PRIMARY INDEX on a int 'id', and a UNIQUE index on the first column after that: 'name' VARCHAR(10).

Why have `id` if your actions are on `name`?

Does EXPLAIN show "Using temporary"? If it needs a temp table, it will first try to use MEMORY. In MEMORY, VARCHAR turns into CHAR, so `name` would be 10 bytes for latin1 and 30 bytes for utf8. If you hit the limit for MEMORY, the table is converted to MyISAM, which would be costly. And you would hit the limit 3x sooner for utf8.

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: UTF8, latin1 and performance impact
4429
March 11, 2014 07:53AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.