MySQL Forums
Forum List  »  Performance

Re: Another High CPU Usage Problem (long post!)
Posted by: Jay Pipes
Date: July 31, 2005 06:25PM

Tobin,

I have a couple thoughts on why you're facing the situation you are; you may not like some of them, and as with anything InnoDB-related, try to get a second opinion from an Innobase Oy developer.

1) As far as your server variables go, everything looks pretty good except for one variable which you might try changing. In the SHOW INNODB STATUS output, it is evident that the asynchronous I/O system in Windows NT (which is unbuffered) is causing much of the contention in the system.

The sections which indicate I/O contention are the SEMAPHORES and the FILE I/O sections. In the SEMAPHORES section, you see that, even though the server has been up and running only for a short amount of time, that all the counters are extremely large. A semaphore is a type of binary mutually-exclusive lock (thus the alien-sounding term mutex). The high counter numbers indicate that both InnoDB and Windows are competing for a shared resource, and that much of the time is being spent by InnoDB (and the CPU) in negotiating this waiting-for-resource game.

This is likely due to the fact that you have a continous spidering application that isn't allowing the I/O subsystem to "catch up". Try reducing the value of innodb_thread_concurrency to 2 instead of 4 and see if that alleviates some of the pressure.

If it doesn't, there are a couple other things to consider. First, you may wish to have your spidering process focus on the *work of spidering*, that is grabbing as many records as it can in as short amount of time as possible. This means separating the spider's run-implementation into the spider, and then an analysis program (which would be responsible for de-dupping and other cleanup). I have used this strategy (of focusing various components or programs on a single task rather than multiple processes) and it seems to be a cleaner and more efficient approach than a large program which tries to do everything at once.

If you can separate the spider's implementation to simply inserting records into the database, then you might consider using a MyISAM table for the "dump records". You could have a separate program (or multi-threaded program) which does analysis by taking a chunk of the dump records and running a scan against them doing the de-duplication and other clean up. Have the analysis program process records from the dump table in batches, instead of a steady stream-based analysis. This will break the database work into reasonable chunks and allow the I/O subsystem to breathe a bit. Architecting the spider/analysis programs in this way also will allow you to more easily maintain or swap out components in the future without affecting the work of the other...

2) As far as the queries go, yes, it's true that the MD5() function utilizes a good amount of CPU power, but I highly doubt that is the main issue (though MD%ing 9800 characters probably doesn't help performance...you've already heard my opinion on that ;) ); by far the bigger problem is your use of the RLIKE construct. This part of your UNION query:

select id
from articles
where ?headline rlike '([^a-zA-Z0-9]|^)(update|updated|revised|version)[^a-zA-Z0-9]'
and headline like ?headline

will never use an index if the LIKE expression is of type "%value%", and instead will perform a table scan across every article record, performing the regular expression match for each record. Especially for a spidering application, this means disaster from a scalability perspective.

Unfortunately, there are currently a number of performance issues related to using regular expression matching in SQL statements. The number one issue is that using RLIKE will *completely* eliminate the possibility of an index being used for the SQL statement. Period. The second issue is that, unlike other programming languages, such as C/Perl/PHP, regular expressions run in MySQL are *not* compiled and stored for repeated use by the calling program. When you do a preg_replace() for instance in PHP, the PCRE compiler will store an execution plan for the regular expression for use the next time a program uses the same search expression. This saves a lot of processing time for repeated calls. This, unfortunately, is not the case for MySQL regular expressions. I would avoid using them entirely.

Based on the regular expression, it seems that you are storing some sort of status identifier in the headline field. It would be more efficient to have separate fields storing this status information about the article. Consider removing these status tokens from the headline and placing them in separate ENUM or SET fields.

Jay Pipes
Community Relations Manager, North America, MySQL Inc.

Got Cluster? http://www.mysql.com/cluster
Personal: http://jpipes.com

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: Another High CPU Usage Problem (long post!)
2652
July 31, 2005 06:25PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.