MySQL Forums
Forum List  »  Performance

Re: Demanding Query
Posted by: Jay Pipes
Date: July 24, 2005 09:39AM

Tobin Harris wrote:
> This query is a duplication check. I'm checking on
> the first 900 characters of the body because the
> business rules dictate that an article is a
> duplicate if it's body text matches that of an
> existing article - the first 900 characters is
> considered a good enough match.
>
> The query is determines if an article should be
> inserted, and is execute by the application (0-3
> times per second). I would have done implemented
> this logic using a unique key, but the unique
> rules vary depending from web site to web site
> (for example, some websites use the same url to
> publish new articles each week, so url isn't
> always present of this dupe check query).

I see. Then, in that case, yes, you would do well to hash the first X number of characters, though I dare say that 100 or 200 characters would be more than enough. The chances that two articles have more than the first 200 characters the same is pretty darn slim. From a de-duplication perspective, however, this may be a poorer indicator of repeated records than a date/author/title combination, if you have that information available. Of course, you would then need to cross-check that syndicated articles are properly handled.

If you are attempting de-duplication from a CRON job script (or something other scheduled job, which it sounds like you are) perhaps the most efficient approach is to issue a single request for existence for each "level" of possible duplication, starting from the highest level of uniqueness, and going downward.

For instance, let's say that you determine the *best* possible duplication check is on the title and url hash, in your script, you'd first check for existence against these two criteria. If existence is determined, the script would continue to the next article entry. If a match was not found, the script would move on to the next logical test for duplicates, perhaps the title and the first 200 characters hash. If existence was found, you'd move on to the next record, if not, you'd try another check.

In this way, you take a "funnel" approach to de-duplication; that is, you eliminate the most common tests first, and only test the less efficient or least likely conditions if all else fails. This structure of programming lends itself to clearer code, more efficient SQL (because only a single statement or test is executed at once), and more maintainable code (because you can comment each test appropriately, and add new tests should one be required). Finally, because the script is being executed in a job (as opposed to multiple web page requests), you aren't requiring repeated database connection calls, which are expensive.

> I forgot to mention that I'm using the InnoDB
> engine for this table (so no .myd file). I chose
> that because the table is subject to many lookups
> and inserts simaltaneously, and I understand that
> InnoDBs granular locking is more suited to such an
> environment. I may be wrong though!

There are numerous reasons to use InnoDB; however, there is a common misunderstanding regarding InnoDB and simulataneous INSERT operations. Many folks are under the impression that because of InnoDB's MVCC pattern, which implements row-level locking to maintain read consistency during UPDATE operations, that InnoDB should be used for high-transaction databases. This isn't *necessarily* the case. Both the MyISAM and InnoDB engines have the ability to do concurrent INSERT operations while simulatanously reading information through SELECT statements. MyISAM actually will do this quite a bit faster, especially for auto-incrementing primary keys. This is not because InnoDB is inherently slower; just that the MyISAM storage engine isn't concerned with transaction-safe operations, and thus has "less to do" than InnoDB does.

So, unless you require transaction-safe operations, or need enforced foreign key constraints, I think you'll find MyISAM to be very performant. However, if you are *also* experiencing high *UPDATE/DELETE* operations, InnoDB might be a better bet because of the row-level locking you mentioned. I suspect, however, that this is not the case, as you probably are not changing the content of articles; you are merely aggregating the content.

Jay Pipes
Community Relations Manager, North America, MySQL Inc.

Got Cluster? http://www.mysql.com/cluster
Personal: http://jpipes.com

Options: ReplyQuote


Subject
Views
Written By
Posted
2532
July 23, 2005 11:31AM
1792
July 23, 2005 01:52PM
1696
July 24, 2005 05:35AM
Re: Demanding Query
1661
July 24, 2005 09:39AM
1726
July 24, 2005 11:44AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.