Re: 4.1 ten times slower when using utf8 charset
Posted by: David Shanahan
Date: March 05, 2005 10:33PM

Harrison,

Thank you for your reply.

I understand that there would be a difference with utf8 but I wasn't expecting such a big difference. I
do appreciate that being correct might require more work but I wonder if a 10X difference might mean
there is some room for tweaking.

In the system in question, the queries and schema are basically simple.

I have a table with about 100,000 records where each record has five 'text' fields and some other smaller varchar
fields. Each text field has between 0-4K of text in it. The queries search for keywords in any or all of the
fields and use this basic construct.

WHERE .... ((column IS NOT NULL) AND (column LIKE '%keyword%')) ...

These are combined using 'OR' for each column and then 'AND' for each keyword. So there might be
30 or so of these constructs.

So as you can see, the query time will be sensitive to the time it takes to do pattern matching on strings
since it is just a brute force search doing lots of string comparisons.

I know that something like full-text indexing would make this a lot - lot - lot faster but we can't use it because
Japanese data doesn't have spaces between words. You just have to know where words start and end.
A utility like Chasen http://chasen.naist.jp/hiki/ChaSen/ (Japanese) can do word separation on Japanese text.

I wonder if using INSTR() or LOCATE() would be faster. I will experiement with those to see if
it makes a difference.

Incidently, we found that PostgreSQL 7.4 was about three times slower than Mysql for the same queries / same data.
That is why we are using Mysql :)

As for sorting, Japanese doesn't really have an intuitive order for characters so it is not that important.
(Just think, could you remember the order of 3000+ characters?) So we are getting away with using latin1
even though it is not strictly correct.

But I would like to use utf8 because it would allow us to convert data on the fly to Shift JIS for
connecting Windows clients directly to the database.

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: 4.1 ten times slower when using utf8 charset
2830
March 05, 2005 10:33PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.