Hi everybody,
I know there are a lot of complaints about optimization for dependent subqueries. This is one more. I try to give a complete test case.
The idea is to JOIN rows in a table to their successors on an index.
-- ----------------------------------------------------------------
--
-- CREATE THE TABLE AND FILL IT WITH 10'000 ROWS OF DATA
--
CREATE TABLE IF NOT EXISTS test (i INT PRIMARY KEY);
DROP PROCEDURE IF EXISTS fill_test;
delimiter //
CREATE PROCEDURE fill_test(e INT)
BEGIN
DECLARE c INT DEFAULT 1;
WHILE c <= e DO
INSERT INTO test VALUES (c);
SET c := c + 1;
END WHILE;
END //
delimiter ;
CALL fill_test(10000);
--
-- DO SOME EXPLAINS FOR DIFFERENT QUERIES
--
EXPLAIN
SELECT MIN(i)
FROM test
WHERE i > 50;
EXPLAIN
SELECT o1.i AS o1,
o2.i AS o2
FROM test AS o1
INNER JOIN test AS o2
ON ( SELECT MIN(i)
FROM test
WHERE i > o1.i ) = o2.i
WHERE o1.i = 50;
EXPLAIN
SELECT o1.i AS o1, o2.i AS o2
FROM test AS o1
INNER JOIN test AS o2
ON ( SELECT MIN(i)
FROM test
WHERE i > o1.i ) = o2.i
WHERE o1.i BETWEEN 50 AND 51;
--
-- CLEAN UP
--
DROP PROCEDURE fill_test;
DROP TABLE test;
-- ----------------------------------------------------------------
Now this is the EXPLAIN output for the three different SELECTs:
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
+----+--------------------+-------+-------+---------------+---------+---------+-------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+---------+---------+-------+------+------------------------------+
| 1 | PRIMARY | o1 | const | PRIMARY | PRIMARY | 4 | const | 1 | Using index |
| 1 | PRIMARY | o2 | const | PRIMARY | PRIMARY | 4 | const | 1 | Using index |
| 2 | DEPENDENT SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+--------------------+-------+-------+---------------+---------+---------+-------+------+------------------------------+
+----+--------------------+-------+--------+---------------+---------+---------+------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+--------+---------------+---------+---------+------+-------+--------------------------+
| 1 | PRIMARY | o1 | range | PRIMARY | PRIMARY | 4 | NULL | 2 | Using where; Using index |
| 1 | PRIMARY | o2 | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | test | index | PRIMARY | PRIMARY | 4 | NULL | 10000 | Using where; Using index |
+----+--------------------+-------+--------+---------------+---------+---------+------+-------+--------------------------+
The first two queries work as expected (they're here only for clarity): As a standalone "subquery" and as a dependent subquery with a constant value in the outer query, the select tables are even optimized away.
But look at the row for the dependent subquery in the third version: Even if the primary key seems to be used from what the optimizer tells us, the number of rows to examine is the full 10'000. Maybe I'm stupid, but I just don't see why this has to be the case (actually, the server really seems to examine all 10'000 rows in the dependent subquery for every row from o1 in the outer query, as you the real execution time suggests if you run the query without EXPLAIN).
I'd expect the server to take one row from o1, execute the dependent subquery with the current value of o1.i and then find the matching row from o2. Of course the subquery has to be executed for every single row from o1 (two rows in the above example due to the WHERE clause). But why does it need to check all 10'000 rows in the subquery? The value in the outer query is a constant for every single execution of the subquery after all, isn't it? Why can't the query be treated similarly to when it's run standalone?
I just wanted to check before I post a bug report. Maybe I miss something. But this kind of behaviour with subqueries really hurts me almost every other day.
Thanks for your thoughts/help,
Beat
Beat Vontobel
http://www.futhark.ch/mysql
Edited 2 time(s). Last edit at 03/08/2006 03:36PM by Beat Vontobel.