Liberating Dependent Subqueries
Ever wonder why MySQL takes a really long time to execute a query containing a dependent subquery, even when the subquery is obviously not really "dependent" on the outer query? I wrote a stored procedure "constify" to help you out with this problem. But first, let me explain.
I have an n-to-n mapping of books to tags.
Here are the books:
mysql> describe books;
+---------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+----------------+
| book_id | int(11) | NO | PRI | NULL | auto_increment |
| title | varchar(50) | NO | | | |
| author | varchar(50) | NO | | | |
+---------+-------------+------+-----+---------+----------------+
Here are the tags:
mysql> describe tags;
+----------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+-------------+------+-----+---------+----------------+
| tag_id | int(11) | NO | PRI | NULL | auto_increment |
| tag_text | varchar(40) | NO | UNI | | |
+----------+-------------+------+-----+---------+----------------+
And here are the mappings:
mysql> describe book_tags;
+---------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+-------+
| book_id | int(11) | NO | | | |
| tag_id | int(11) | NO | | | |
+---------+---------+------+-----+---------+-------+
Some of my users have been abusing the system and adding tags that are forbidden by my user agreement (obscene language, that sort of thing). In order not to run afoul of local laws, I need to remove all the book-to-tag mappings for prohibited tags - otherwise these tags will be shown on my website.
Let's say one of the forbidden words is "bunk". Some of my users have created tags that contain this word ("product is bunk", "total bunk"... you get the idea). I would like to run a query as follows:
DELETE FROM book_tags WHERE tag_id IN ( select tag_id from tags where LOCATE('bunk',tag_text)>0 );
(Okay, I wouldn’t really do that but let’s pretend....)
This will work, but if I am Amazon.com and I have many millions of books in my table, and many millions of mappings in the book_tags table, it will take prohibitively long: the dependent subquery is evaluated for every row in the outer query, that is, for every row in book_tags the inner query will be evaluated again:
mysql> explain select * from users_tags where tag_id in ( select tag_id from tags where LOCATE('bunk', tag_text)>0 );
+----+--------------------+-----------+-----------------+---------------+---------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-----------+-----------------+---------------+---------+---------+------+----------+--------------------------+
| 1 | PRIMARY | book_tags | ALL | NULL | NULL | NULL | NULL | 49032352 | Using where |
| 2 | DEPENDENT SUBQUERY | tags | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using index; Using where |
+----+--------------------+-----------+-----------------+---------------+---------+---------+------+----------+--------------------------+
Recomposing the query as a join gets me better performance, but not nearly good enough:
DELETE FROM book_tags bt USING book_tags bt, tags t WHERE bt.tag_id = t.tag_id AND LOCATE('bunk', t.tag_text)>0;
This is equivalent to the following SELECT, which has the performance as shown:
mysql> explain select * from book_tags bt, tags t WHERE bt.tag_id = t.tag_id AND LOCATE('bunk', t.tag_text)>0;
+----+-------------+-------+--------+---------------+---------+---------+----------------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+----------------+----------+-------------+
| 1 | SIMPLE | bt | ALL | tag_id | NULL | NULL | NULL | 49032352 | |
| 1 | SIMPLE | t | eq_ref | PRIMARY | PRIMARY | 4 | temp.bt.tag_id | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+----------------+----------+-------------+
This form would execute more quickly.
But we can do much better. The real issue is that the subquery here is not really dependent on any of the columns in the outer query. The subquery will return the same result every time it is evaluated. You know that. I know that. But MySQL does not know that.
What I want to be able to do is as follows:
DELETE FROM book_tags WHERE tag_id IN ( constify("SELECT tag_id FROM tags WHERE LOCATE('bunk', tag_text)>0") );
If I had this function "constify", it would evaluate the subquery and turn it into a constant expression (the list of tag_ids that containing “bunk”), a result such as "1,2,56,1000,30405,1010200,10203439". The resulting query would be even faster still. It would look like this:
DELETE FROM book_tags WHERE tag_id IN ( 1,2,56,1000,30405,1010200,10203439 );
And its performance:
mysql> EXPLAIN SELECT * FROM book_tags WHERE tag_id IN ( 1,2,56,1000,30405,1010200,10203439 );
+----+-------------+-----------+-------+---------------+--------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+--------+---------+------+-------+-------------+
| 1 | SIMPLE | book_tags | range | tag_id | tag_id | 4 | NULL | 11433 | Using where |
+----+-------------+-----------+-------+---------------+--------+---------+------+-------+-------------+
This is the performance I’m looking for. This "constify" function would help me get it.
Well, I don’t have a "constify" function (more on this below), but I do have a "constify" stored procedure, which lets me do as follows:
mysql> call constify("SELECT tag_id FROM tags WHERE LOCATE('bunk', tag_text)>0", @result);
mysql> set @query=CONCAT("DELETE FROM book_tags WHERE tag_id IN (", @result, ")");
mysql> PREPARE st FROM @query;
mysql> EXECUTE st;
mysql> DEALLOCATE PREPARE st;
[Note: above commands edited. Please see a follow-up post for improved syntax.]
Its syntax is a little messier than with a function, but it's the best we can do given MySQL's limitations (again, see below). Here’s the constify stored procedure:
CREATE PROCEDURE constify (IN query_ TEXT, OUT constStr TEXT)
MODIFIES SQL DATA
SQL SECURITY INVOKER
COMMENT 'Turns the results of the given query into a comma-separated str.'
BEGIN
-- Important notes:
-- The provided query must specify a single-column result
-- Each row in the result must be no longer than 50 characters
-- (to change this, change the declaration of oneItem below)
declare flagLastItem INT DEFAULT 0;
declare oneItem VARCHAR(50);
declare itemCursor cursor for select * from constifyResults;
declare continue handler for SQLSTATE '02000' SET flagLastItem = 1;
set @constifyResult = "";
set @sep = "";
set @query = query_;
set @createStmt = CONCAT("create temporary table constifyResults ", @query);
drop table if exists constifyResults;
prepare stmt from @createStmt;
execute stmt;
deallocate prepare stmt;
open itemCursor;
theLoop: repeat
fetch itemCursor into oneItem;
if not flagLastItem then
set @thisItem = oneItem;
set @constifyResult = CONCAT(@constifyResult, @sep, @thisItem);
set @sep = ",";
end if;
until flagLastItem = 1 end repeat;
close itemCursor;
set constStr = @constifyResult;
END
Use constify with slight caution: don’t constify queries that return too many results! The return value (the string containing the comma-separated values) is limited by the max_packet_length variable, because the result must be sent back to the client, via the socket, in a packet.
Also, be careful when you use constify to select text that might contain MySQL delimiters (single quotes, double quotes, etc.): in this case be sure to SELECT QUOTE(text_field) in your query so the text is properly treated.
The reason why I didn’t make constify a function is because dynamic SQL (prepared statements) is not allowed in functions. But, it should be possible to refactor out the dynamic SQL into a separate stored procedure and then constify could be written as a function. <EDIT> Correction: Nope, MySQL does not support calling stored procedures that contain dynamic SQL from functions (or from triggers), so the constify stored procedure provides the best calling syntax that can be managed. Also, see my follow-up post in this thread for an even better calling syntax for constify.</EDIT>
Using the constify stored procedure above I have removed the dependent subquery and turned it into a constant expression, resulting in the fastest possible execution time for this type of query.
Edited 2 time(s). Last edit at 12/05/2006 07:02AM by Shlomo Swidler.