Re: query not using index
Posted by: Rick James
Date: November 09, 2013 03:42PM

> I can understand that, however, shouldn't it ultimately point out that it *is* utilizing the index ?

If it actually did decide to use the index. If more than 20% of the index needs to be scanned, the optimizer is likely to decide not to bother bouncing back and forth between the index and the data; instead it is likely to be faster to simply scan the data. (The "20%" varies with the phase of the moon.)

> taking out the known trouble maker SQL_CALC_FOUND_ROWS

Yeah, that can be a "trouble maker. It effectively requires finishing the query as if there were no LIMIT, but without delivering any data. It may not be totally bad -- if it can use an index to finish the query (which is not your case).

> Well I could... if I have had the uid as a firsthand filtering condition, which I don't.

That's not the issue.
(1) hauling around g.* in a temp table is costly.
(2) if an index has _all_ the fields of the subquery (lat, lng, time, uid), it could be used, thereby minimizing the bulk to haul around -- even if it has to do a full index scan.

Is there a "LIMIT 100" that you have failed to mention??

The "99963.00%" is bogus -- brought on by the LIMIT. (I wish the developers would 'fix' it.)

> along some space partitioning curve such as the hilbert curve

My previous 'best' lat/lng solution involved "Z-ordering". I now feel that the link I gave you is the best available. Akiban (now a part of FoundationDB) actually implemented a Z-order type of index.

The lat/lng problem (at large scale) is all about "counting the disk hits". Let me explain why all the solutions you have tried are costly. (And many others have tried and failed.)
INDEX(lat, lng, ...) with
WHERE lat BETWEEN ... AND lng BETWEEN...
will necessarily scan _all_ the records for "lat BETWEEN ...". This is likely to be a sizable percentage of your dataset, and have to throw out a sizable percentage of what it scans to filter on lng. No flavor of simple BTree indexing can avoid scanning all then lngs for those lats.
That discusses one possibly INDEX; other variants have similar arguments and failings.

Arranging it so that all the searching is in one table (or one index, as I suggested with INDEX(lat, lng, time, uid)) only partially helps. It helps because the index is narrower (less bulky, hence fewer disk hits). It is only 'partial' since it still has to scan a lot of records. Similarly, ICP is only a "partial" solution.

The ultimate goal it to fetch only the necessary rows. Your solutions involving fetching, then filtering out, lots of rows. My solution comes closer.

And another point (again, relating to "counting the disk hits")...
When consecutive rows (of data, via an InnoDB PRIMARY KEY, or in an index) are fetched, you find many (Rule of Thumb: 100) rows per block. A "block" is a unit of I/O. Minimizing disk hits means minimizing the number of blocks fetched. A 'random' record is likely to get only one row per disk hit. Where am I headed with this point?

Let's analyze INDEX(lat, ...) with your WHERE...
(1) It is getting ~100 rows per disk read (good)
(2) It is tossing many of the rows because lng does not match (not so good)
(3) It is reaching into the data, probably randomly, to get g.* -- lots more than the 100 ultimately desired (even worse)
(4) Copying to a temp table -- (more bad)
(5) Sorting on time
(6) finally delivering only 100 rows.

My partial solution (INDEX(lat, lng, time, uid))...
(1) It is getting ~100 rows per disk read (good)
(2) It is tossing many of the rows because lng does not match (not so good)
(4) Copying only (time, uid) to a temp table -- (not too bad)
(5) Sorting on time
(6) finally delivering only 100 rows.
The removal of step (3) may be significant. But it 2,4,5,6 are still bulked up.

My solution (in the link) involves...
(1) Reads a small number of blocks to get all the uids necessary. The algorithm is carefully aimed at minimizing blocks read. (The details of this are in the complexity of the algorithm, and of the structure of the two tables.)
(2) Randomly fetching (via uid) the 100 rows desired. No more than the 100.
Adding SQL_CALC_FOUND_ROWS back in will hurt, but probably not as badly as it does for you now.

Call me arrogant if you like, but I know what I am talking about.

Options: ReplyQuote


Subject
Written By
Posted
October 31, 2013 07:40AM
November 02, 2013 02:58PM
November 03, 2013 12:43PM
November 04, 2013 05:39AM
November 05, 2013 09:53PM
November 08, 2013 09:42AM
Re: query not using index
November 09, 2013 03:42PM
November 12, 2013 09:54PM
November 13, 2013 06:31PM
November 13, 2013 08:56PM
November 18, 2013 12:20AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.