WHERE IN() vs. BETWEEN
I've been fighting with queries on a relatively large table for the past few days and have come across behavior that I'm hoping somebody can explain to me.
The table in question is:
CREATE TABLE `t` (
`d` date NOT NULL default '0000-00-00',
`src` char(3) NOT NULL default '',
`dst` char(3) NOT NULL default '',
`tot` decimal(10,1) default NULL,
PRIMARY KEY (`d`,`src`,`dst`,`tot`),
) ENGINE=MyISAM DEFAULT CHARSET=latin1
The table has approx. 21 million rows. Dates are always stored in YYYY-MM-01 format (dates are only applicable to months).
Here is the query:
SELECT * FROM t
WHERE
d BETWEEN "2006-01-01" AND "2006-12-01"
# d IN (
# "2006-01-01","2006-02-01","2006-03-01","2006-04-01","2006-05-01","2006-06-01",
# "2006-07-01","2006-08-01","2006-09-01","2006-10-01","2006-11-01","2006-12-01" )
AND src IN ( "aaa","bbb", ... ,"zzz" )
AND dst IN ( "aaa","bbb", ... ,"zzz" )
ORDER BY tot
LIMIT 10
A pretty simple query all in all - just returning the top 10 'tot' rows for a given date range, 'src' array and 'dst' array.
Now, for the confusion. There are large differences whether you are using BETWEEN or IN() for the date range (see the two methods in the query above - one is commented out).
When using BETWEEN the profiler says the majority of time is spent in "Sorting Result"
When using WHERE IN() the profiler says the majority of time is spent in "statistics" and "preparing"
So, if you have small arrays for 'src' and 'dst' the WHERE IN() method is much faster then using BETWEEN(). In fact BETWEEN() seems to have a constant time associated with it about about 14 seconds on my system all spent in "Sorting Data" no matter the size of the arrays.
However, once your arrays start getting large WHERE IN() becomes extremely slow and BETWEEN() becomes the much faster option.
My question is why? How is the optimizer treating these two queries so differently when the actually equate to the exact same thing?