MySQL Forums
Forum List  »  Performance

Performance of ON DUPLICATE KEY and other options?
Posted by: Mark Modrall
Date: September 23, 2005 09:20AM

Hi...

I have an app tracking user submissions (basically a user id, time, count, and text input). Since there's going to be some overlap between users typing in the same input, I thought I'd save time and space by having the input in a table with

qid INT AUTO_INCREMENT,
input TEXT,
UNIQUE (input)

and a normalized table of
qid INT,
userid INT,
date DATETIME,
count INT.

I insert about 200-300K rows at a time, first doing an
INSERT INTO inputTable (input) VALUE (?in) ON DUPLICATE KEY UPDATE qid=LAST_INSERT_ID(qid),

then I insert the user stats into the other table using the qid.

Turns out that there is some considerable overlap in user input; after loading a month of data, there are 5.4M rows in inputTable and 10.5M rows in userTable.

The thing that puzzles me more is that while doing my table loading, inserting those 5.4M rows in inputTable takes 85% of the execution time. Inserting 10.5M rows into userTable takes about 10% of the time and about 5% goes into other processing.

I can understand how managing an arbitrary text index may be more time consuming than inserting a bunch of ints into a table without any indexes, but I was surprised to see there was such a discrepancy (inserting half as many rows takes 8x as long as inserting double the number of rows in the latter). So my questions:

1) Would there be an appreciable performance benefit to replacing ON DUPLICATE KEY UPDATE with a SELECT then either INSERT or UPDATE depending on the response in code? There are a couple of other boolean flags in the insertTable that were also updated with the ON DUPLICATE KEY statement; when I dropped a couple of those as an experiment last night, I noticed an appreciable improvement in performance. That got me wondering if the ON DUPLICATE KEY part was inefficient.

2) Would there be a big difference between declaring input as TEXT vs VARCHAR[250]? The table is also in utf-8, so there's a limit to how big I can make the VARCHAR, but if I can fit in those limits, would one representation be a big boost over another?

3) I had been thinking that for certain usages a HASH index might be more effective than a BTREE and it struck me that sifting through arbitrary text might be one of those cases. But then I saw that HASH was only available on MEMORY tables. I can understand the difficulties in figuring out how to represent a hash table on disk and how to grow it effectively, but are there going to be any more index types coming in future revs? Like the bitmap indexes from oracle or other types that might be faster in certain situations?

4) anyone have any general tips on improving performance of this type of app? I'm already LOCKing the tables before doing my inserts and that helped some. I've Prepare()d the statements and that helped a little. But I'm still running at about 25 minutes per 300K inserts, and that makes it hard to tweak and try again.

Thanks
_mark

Options: ReplyQuote


Subject
Views
Written By
Posted
Performance of ON DUPLICATE KEY and other options?
2236
September 23, 2005 09:20AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.