Optimizer discssion - sequence of tables VS Straight_Join?
Posted by:
J Jorgenson
Date: September 26, 2006 03:24PM
I've been inspecting a lot of explain plans lately and trying to come up with a "Generalized Rule of Efficient SQL queries" for some of our users who are writing 'custom' SQLs.
These users are beyond the basic "make sure you hit your indexes", and are starting to use 'STRAIGHT_JOIN' to force better performance from out of 60+mil record tables. (We do a lot of data-mining! No two queries are the same.)
**
** REASON for this POST:
**
** I'd like to confirm some
** basic Guide-Lines for 'STRAIGHT_JOIN'
**
A) Do a "Full Scan" on WIDE-LONG table Joined to the Smallest Table with "INDEX Only Look-up". The subsequence tables get larger.
B) Sequence your tables as
... 1) FATest LONGest (60mil) table that EXPAIN ='Using Where'
... 2) SHORTest table (2mil) EXPLAIN = 'Using Index' -- << KEY part of Guideline
... 3) 2nd SHORTest table (12mil) EXPLAIN = 'Using Index'
... 4) 3rd SHORTest table (20mil) EXPLAIN = 'Using Where'
... 5) Xth SHORTest table (30mil) EXPLAIN = 'Using Where'
... 6) 2nd LONGest table (50mil) EXPLAIN = ''
Reasoning: Jumping around in memory is faster than on disk... Scanning the Fastest/Longest table on disk (which uses disk cache), immediately followed by a 'filter' of the shortest table (using key-buffer in memory) reduces the # of table/disk lookups into the following tables,
I've seen MANY (90%) of times where the SQL optimizer will Full Scan an Index in memory and Index Look-up (jump-around) on the WIDE-LONG table, taking hours to return, where the inverse produces results in minutes.
Given tables like :
Create Table A (u_id INT, grp_id INT, dt DATE, PRIMARY KEY(u_id, grp_id)); -- 51mil
Create Table B (u_id INT, 50+ fields of data, PRIMARY KEY(u_id)); -- 20mil Records
Create Table C (u_id INT, 50+ data fields, PRIMARY KEY(u_id)); -- 12mil Records
Create Table D (u_id INT, 128+ data fields, PRIMARY KEY(u_id)); -- 7mil records
Create Table E (u_id INT, 10+ INT data fields, PRIMARY KEY(u_id)); -- 2mil Records
NOTE: In Table 'A' the 'u_id' field is not unique without a grp_id.
SELECT E.u_id
FROM A, B, C, D, E
WHERE A.u_id = E.u_id AND B.u_id = E.u_id
AND D.u_id = E.u_id AND E.u_id = E.u_id
AND A.data_field > 105.99
AND C.data_field IN (13,14)
AND D.data_field = 'ABC'
AND E.data_field > 230.99
The optimizer chooses the sequence of Largest to Smallest tables:
A type=ALL, possible_keys=none
B type=eq_ref, keys=PRIMARY, rows=1, Extra='Using Where'
C type=eq_ref, keys=PRIMARY, rows=1, Extra='Using Where'
D type=eq_ref, keys=PRIMARY, rows=1, Extra='Using Where'
E type=eq_ref, keys=PRIMARY, rows=1, Extra='Using Where'
This takes longer time to process, due to Disk Seeks.
Using the above Guide Line for STRAIGHT_JOIN, the table seqence of:
A, E, D, C, B
Produces results faster, with much less disk contention.
**
** Q) Is there something I'm miss-interpreting?
** Q) Am I over-simplfying the situation?
** Q) Why does the Optimizer (usually) sequence the tables from Largest to Smallest?
** Q) When an 'Using Index' occurs, why does that table appear last in Sequence?
**
-- J --