Performance of ON DUPLICATE KEY and other options?
Hi...
I have an app tracking user submissions (basically a user id, time, count, and text input). Since there's going to be some overlap between users typing in the same input, I thought I'd save time and space by having the input in a table with
qid INT AUTO_INCREMENT,
input TEXT,
UNIQUE (input)
and a normalized table of
qid INT,
userid INT,
date DATETIME,
count INT.
I insert about 200-300K rows at a time, first doing an
INSERT INTO inputTable (input) VALUE (?in) ON DUPLICATE KEY UPDATE qid=LAST_INSERT_ID(qid),
then I insert the user stats into the other table using the qid.
Turns out that there is some considerable overlap in user input; after loading a month of data, there are 5.4M rows in inputTable and 10.5M rows in userTable.
The thing that puzzles me more is that while doing my table loading, inserting those 5.4M rows in inputTable takes 85% of the execution time. Inserting 10.5M rows into userTable takes about 10% of the time and about 5% goes into other processing.
I can understand how managing an arbitrary text index may be more time consuming than inserting a bunch of ints into a table without any indexes, but I was surprised to see there was such a discrepancy (inserting half as many rows takes 8x as long as inserting double the number of rows in the latter). So my questions:
1) Would there be an appreciable performance benefit to replacing ON DUPLICATE KEY UPDATE with a SELECT then either INSERT or UPDATE depending on the response in code? There are a couple of other boolean flags in the insertTable that were also updated with the ON DUPLICATE KEY statement; when I dropped a couple of those as an experiment last night, I noticed an appreciable improvement in performance. That got me wondering if the ON DUPLICATE KEY part was inefficient.
2) Would there be a big difference between declaring input as TEXT vs VARCHAR[250]? The table is also in utf-8, so there's a limit to how big I can make the VARCHAR, but if I can fit in those limits, would one representation be a big boost over another?
3) I had been thinking that for certain usages a HASH index might be more effective than a BTREE and it struck me that sifting through arbitrary text might be one of those cases. But then I saw that HASH was only available on MEMORY tables. I can understand the difficulties in figuring out how to represent a hash table on disk and how to grow it effectively, but are there going to be any more index types coming in future revs? Like the bitmap indexes from oracle or other types that might be faster in certain situations?
4) anyone have any general tips on improving performance of this type of app? I'm already LOCKing the tables before doing my inserts and that helped some. I've Prepare()d the statements and that helped a little. But I'm still running at about 25 minutes per 300K inserts, and that makes it hard to tweak and try again.
Thanks
_mark
Subject
Views
Written By
Posted
Performance of ON DUPLICATE KEY and other options?
2236
September 23, 2005 09:20AM
1347
September 23, 2005 10:48AM
1422
September 23, 2005 10:56AM
1349
September 23, 2005 11:20AM
1486
September 23, 2005 11:38AM
1357
September 23, 2005 03:33PM
1356
September 23, 2005 04:25PM
1287
September 23, 2005 04:59PM
1506
September 26, 2005 02:30PM
1373
November 10, 2005 10:12AM
1409
November 10, 2005 10:26AM
1329
November 10, 2005 02:19PM