MySQL Forums
Forum List  »  Performance

Re: Sporadic Query "Freeze" with InnoDB, highly concurrent workload
Posted by: Steve Ramage
Date: February 25, 2014 11:28PM

Hi Rick,

> My feeling on "MySQL as a queue" -- "Don't queue it, just do it!"

>The effort and cost of enqueuing and dequeuing tasks is often higher than the actual work being done. Hence MySQL slows down the process and consumes resources.

Thank you for responding, let me address your concerns head on. Yes I'm aware of the problems of using MySQL as a queue, and the common pitfalls and I believe I have carefully avoided them. In our case MySQL is the only option, as the application in question is used for collecting data for research experiments. In our case the tasks are actually generally very expensive although it fluctuates and ranges anywhere from 1 second to 3 or 4 hours for a single task, although it depends on the specific data being collected.

Next let me explain the set up a bit better, this a problem that happens every so often on two different servers, but I haven't been able to reproduce it because on the former server, we need to get data as soon as possible and the workers are going to die soon (they are jobs scheduled on a highly contended scientific cluster, and so if we lose them we have to wait hours or days to get them back). The new server, where this problem occurred most recently is brand new, and has zero contention and so I can investigate it more. It also has *NOTHING* else executing.

>How many "workers" are there? Do they hang onto a transaction for the duration of their "work"? That is, are they locking the queue _and_ the other tables?

The workers can range anywhere from one to a thousand and pull runs in batches, they will generally poll every so often and take some number of runs before continuing, if they finish quickly they will hold off some time before getting new runs. As runs progress they will update the affected row (they only process one row at a time), in a query accessed by primary key. There are no multi statement transactions, and there are essentially no other tables of concern here (there are only 5 in the table, three of which are diagnostic only, and one of which is only ever queried sporadically). There also are zero locks on the queue, all the contention magic that happens is in that miracle UPDATE clause.

> How big does the queue get to? If the table gets especially big, then the following points could be important:

The queue can get upwards of millions of elements which is fine, eventually the table will no longer be used (the tables are built for a particular experiment and data collection event and then stay there, the tables also serve as a cache so I can't move results to another table). The table at this point has the followings:

mysql> SHOW TABLE STATUS FROM dsmac like 'runConfigs' \G;
*************************** 1. row ***************************
Name: runConfigs
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 1722585
Avg_row_length: 780
Data_length: 1345323008
Max_data_length: 0
Index_length: 517603328
Data_free: 15728640
Auto_increment: 13452772
Create_time: 2014-02-21 05:38:03
Update_time: NULL
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.15 sec)

> How many "enqueuers" are there? It might be advantageous to throttle how many threads can attempt to enqueue items.

There is generally no dedicated enquerer, sometimes there is a batch process that executes stuff in sequence. In other cases there may be, but in this specific case it is a set of scripts that gets invoked to insert things into the database, and then you run the program later when it is done (typically 4 or 5 days later).

*NOTE: THE UUID IS NOT A UUID (it is named that way for historical reasons unfortunately :( ), it is actually a SHA1 hash of the other columns computed by the application. This is how we detect a cache hit.

The SHA1 hash could be compacted but it's nice to have a human readable string at times to correlate log messages. Most other columns are pretty big so the savings of DOUBLE to FLOAT don't matter.

The server has 24 GB of RAM, and right now this is the only database on it and it's likely that these kind of databases will be the only thing on it.

> What is the value of innodb_buffer_pool_size?


Some variables are:
innodb_log_file_size: 1073741824
innodb_buffer_pool_size is: 17179869184

Additionally the binlog is entirely off, since point in time data or recovery is generally useless and most users (this is part of a library I maintain) would lack the sophistication to write applications that would benefit from a bin log, and would simply restart the experiment if something went wrong.

> > PRIMARY KEY (`runConfigID`),
> UNIQUE KEY `runConfigUUID` (`runConfigUUID`),

> A PRIMARY KEY is a UNIQUE KEY, so the second index is totally redundant, and costs some effort on both INSERT and DELETE.

The Unique key is necessary to ensure cache hits (because inserts are done in batches we need a way of knowing what we would submit. Previous versions of the application (and indeed why it is called a UUID) used the UUID as a primary key but this was terrible for performance for a few reasons. One all indexes in INNODB would have the 40 bytes hanging off the end of it, and the second is that I noticed when using it as a key runs would be taken neither in FIFO order or key order but through some weirdo distribution. Tasks that are submitted to the DB are submitted in batches (the million or so tasks in the db right now are from about 250 different batches of tasks, that all need to be completed for it to be useful). When keys are sampled in FIFO order, if someone inserts something knew, you essentially have to wait for all runs to be finished instead of them progressing in some predictable order.

Additionally runs are never deleted from the table, (the tables lifetime is only about a month at which point it is never written to again).

> Four secondary keys start with `status`. Is there an UPDATE that changes STATUS? If so, that would be like changing 5 tables (the table, plus the 4 secondary indexes). Rethink whether you need all of them. Run EXPLAIN SELECT... to see if all of them are actually used.

> KEY `status2` (`status`,`priority`),

Yes all the keys are actually used, and they exist as covering indexes for all the queries so that they don't have to go to the table. The first is the standard query to get access to general progress, as well as for updating workers. I believe the second is a covering index on the UPDATE which I'm not sure makes sense, it's also possible that the query it existed for doesn't matter. The third is used to check whether workers can make any progress in the time they have left, and the last to find jobs that have disappeared. There are UPDATES that change status, including the one causing the problem.

The status2 key is the one used in the actual query that's the problem if you look at the EXPLAIN.

> > WHERE status="NEW" AND priority="HIGH" ORDER BY runConfigID LIMIT 5

would benefit from
INDEX(status, priority, runConfigID)
instead of what you have. (There is some dispute as to whether that is implicitly what the key is.)


My assumption is that that is what the key actually is. Is it not?

>I think your big UNION can be replaced by
WHERE status="NEW" ORDER BY priority DESC, runConfigID
or perhaps
WHERE status="NEW" ORDER BY 0+priority DESC, runConfigID


I don't think I can do that. My understanding of MySQL is that only one column can be used in an ORDER BY, by an INDEX. Since 99% of the time there are only jobs of one priority, it's essentially the same as not using the index for sorting when it is perfect for sorting. I also thought that you can't do DESC in one and then ASC in another without there being major pain, but that might be fixable.

Finally let me elaborate a bit on the issue as I see it. The queue stuff is irrelevant. When this problem occurs there are ZERO other connections to MySQL, Some rows with status="NEW", some rows with status="COMPLETED". If I then in a bash script execute the above query (this is the experiment I ran) via the mysql command line mysql -u.... < deque.sql (and the contents of the file are just that single query):

UPDATE runConfigs A JOIN (
SELECT runConfigID, priority FROM (
(SELECT runConfigID,0 AS priority FROM runConfigs WHERE status="NEW" AND priority="LOW" ORDER BY runConfigID LIMIT 5) UNION ALL
(SELECT runConfigID,1 AS priority FROM runConfigs WHERE status="NEW" AND priority="NORMAL" ORDER BY runConfigID LIMIT 5) UNION ALL
(SELECT runConfigID,2 AS priority FROM runConfigs WHERE status="NEW" AND priority="HIGH" ORDER BY runConfigID LIMIT 5) UNION ALL
(SELECT runConfigID,3 AS priority FROM runConfigs WHERE status="NEW" AND priority="UBER" ORDER BY runConfigID LIMIT 5) ) innerTable
ORDER BY priority
DESC LIMIT 5 ) B
ON B.runConfigID=A.runConfigID SET status="ASSIGNED", workerUUID="b4572d-8448-4729-bf7d-9d1da100dfa1", retryAttempts = retryAttempts+1;

Then 99.999% of the time that query takes 0.01 seconds, and then occasssionally and temporarily it takes like 2 or 3 seconds. If I get rid of the backlog of queued items, this situation will persist until something internal to the table or mysql changes. Typically I have fixed it with an OPTIMIZE TABLE and then restart everything and it hums along again. In the case the other day, I was able to simply take the runs that were labelled ASSIGNED and move them back to NEW but after a few cycles of doing this, they dropped back down to 0.01 seconds. This happens on two different SQL servers on two very different hardware environments on two different continents (at two different, unaffiliated universities).

There is seemingly no rhyme or reason why all of a sudden the query takes a long time or why the performance problem disappears soon after. So I am looking for advice as to what to look at while this is happening.

Thanks,

Steve



Edited 1 time(s). Last edit at 02/25/2014 11:29PM by Steve Ramage.

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: Sporadic Query "Freeze" with InnoDB, highly concurrent workload
1039
February 25, 2014 11:28PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.