MySQL Forums
Forum List  »  Performance

Re: Attempt to split big table into smaller one made query slower
Posted by: Rick James
Date: March 20, 2010 10:28AM

The order of where clause components is ignored.

Roughly speaking... The query optimizer looks at the query, finds the tables being used, finds all the indexes for those tables, and then starts doing some analysis. Also, it reads the "statistics" that were previously collected for the indexes. Let's walk through things with your example:
select ... (mostly doesn't matter; needed for "Using index")
    from  reportGoogleKeyword g,
          pageKeywords2 p   (ok, 2 tables; need to decide which first)
    where  p.dateAdded >= '2010...' and  p.dateAdded <= '2010...'  (range)
      and  g.clientID = 110  (equal: prime candidate)
      and  g.dateAdded >= '2010...' and  g.dateAdded <= '2010...'
      and  p.keyword = g.keyword   (for joining)
    group by  g.position, g.keyword  (consider after WHERE)
    order by  megaSum desc, g.keyword, g.position  (consider after GROUP BY)

Caveat: What follows is a cross between how the optimizer works and how I analyze the query.

Plan A (Start with g) clientID=110 -- let's hope there is an index _starting_ with clientID. If the stats say that "most" of the table is 110, then this could turn out to be a poor choice.

Plan B (Start with p) well, no "=" (can't use the JOIN until we get through with the first table). There is a range on dataAdded -- maybe there is an index on that.

Plan A is (so far) the better one. (I'm unclear on how far the optimizer will chase Plan B before giving up.) What next for Plan A? Let's assume there is/are INDEXes beginning with clientID. Well, we have not finished with the WHERE. If the second part of such an index is dateAdded, we can make use of it for the rest of the filtering on g. Good. But we have to stop here. When we get to a 'range', further fields in the index are useless (exception to follow).

OK, let's say we have INDEX(clientID, dateAdded). What happens next? Next, the optimizer has to figure out how to do the JOIN. Now we are down to
    where  p.dateAdded >= '2010...' and  p.dateAdded <= '2010...'  (range)
      and  p.keyword = (value from g)
Since MySQL only (almost only) uses "Nested Loop Join" and rarely "Merge Join", I will assume NLJ. This means that, for each row it has so far found in g, it will probe p to see if it still needs the record, and grab the any needed values from p. That is, it will now repeatedly reach into p with keyword=constant and dateAdded in range. Again, using the previous logic, it would hope for INDEX(keyword, dateAdded).

But note something else. While it is rummaging through g, it is fetching any values it will need to finish the query. These values may be anywhere in the SELECT. And if they are bulky (TEXT, BLOB), too bad. Those will be hauled around and possibly tossed later due to filtering on p or a subsequent LIMIT.

If possible, it will use a MEMORY table for the intermediate "tmp" table. TEXT, BLOB, and size restrictions could prevent using MEMORY; MyISAM is the fallback. "Using temporary" and "filesort" are imprecise -- they threaten you with MyISAM spilling to disk, but it might be a very efficient MEMORY tmp table.

Back to the saga... We have now collected a bunch of g columns an p columns into our tmp table, but we still need to deal with GROUP BY and ORDER BY (no HAVING or LIMIT, this time).

GROUP BY might be done in either of two ways. (I don't have a good feel for when each is chosen:

GROUP BY 1: Sort the rows, walk through them performing the aggregates (COUNT, SUM, etc) as you go. One row output for each completed 'group'.

GROUP BY 2: (No sort) Keep a hash table, keyed on what you are grouping by. For each row, find the hash element, do the aggregation. In the end, dump the (very-unsorted) hash table.

ORDER BY comes next (in your situation). Again there are choices.

ORDER BY 1: If the ORDER BY is the same as the GROUP BY, then lets lean toward "GROUP BY 1"; this means we don't have to do anything for the ORDER BY. This does not apply for you.

ORDER BY 2: Sort. This may involve another tmp table.

"Using index" -- If ALL the fields of one table that are used ANYWHERE in the SELECT exist in a SINGLE INDEX, then that index MAY be preferred. This is because scanning an index -- without reaching over to the data -- is usually faster. This did not seem practical in your case.

Another note. InnoDB's PRIMARY KEY is embedded in the data, so it characteristics are not like other keys.

There are lots more details of what could happen.

There biggest impediment to further optimizing _your_ query is the need to filter on both of two tables, thereby hauling around the tmp table.

Options: ReplyQuote




Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.