Re: Demanding Query
Hi Jay,
Your thoughts are much appreciated, thank you. Some comments...
I added the body text hash column this afternoon, and have finally managed to get something that is performant. Woohoo! I agree that 200 might be enough for deduping purposes, I'll put that to my client. Unfortunately we don't have sophisticated enough parsing to get at author, article and date, but I agree this would be good deduping criteria. You hit the nail on the head though - we're basically trying to eliminate syndicated content with our body text match.
Interestingly, when I was analysing the query, I found that thing were slow even once I got it using indexes (using union queries and hash fields). The *main* time eater was on the following clause...
"or ( body_hash = md5(left(?bodytext,900)) )"
If I pre-calculated the "md5(left(?bodytext,900))" in a separate statement, rather than inside the main query, the performance was much much better. Query time went from 6.5 seconds to 0.3 seconds following this tweak. I wandered if MySql was evaluating it for each row?
Your thoughts on funneling were interesting, and I'm pleased you said that because it's actually something we do already. Always good to hear someone confirm your own opinion! We have an initial dedupe check, and then a final one before the insert. We don't do it in a cron job though - new articles go into an in-memory queue and are inserted by background processes that chomp away at the queue continually.
Having used MSSQL 2000, I do miss the depth of information you can get explaining queries. MySql has a very simple explain plan. Also, MSSQL can use multiple indexes in a single query, which is quite handy. It will be interesting to see what MySql 5.0 brings!
Thanks for thoughts on InnoDB vs MyISAM. I think it will be worth trying - we don't do many updates or deletes after the original insert. We *dp* lots of SELECTS and INSERTS simultaneously, but from what you say this might be handled well my MyISAM.
Again, thank you very much for you help on this.
Tobin