MySQL Forums
Forum List  »  General

Re: Optimised way to search over 2 milllion poi data in mysql
Posted by: Rick James
Date: September 07, 2014 10:40AM

> So my final ...

Let's see the whole thing...
SHOW CREATE TABLE (after adding the indexes)
SHOW TABLE STATUS (with 2M rows)
SHOW VARIABLES LIKE '%buffer%';

> What do you mean by 40K in each lat and long?

If you have 2M rows and you have "WHERE lat BETWEEN ... AND ...", then there might be about 40K rows satisfying the `lat` constraint or the equivelent `long` constraint. (40K was pulled out of a hat. Do some SELECT COUNT(*)'s with your data to see what a realistic number is.) With INDEX(lat, long), it will have to look at _all_ 40K rows in the INDEX that satisfy the lat constraint, checking each for the `long` constraint. That is, it will have to scan 40K's worth of blocks in the INDEX; this is likely to be on the order of 100 blocks. In a huge table (not a mere 2M), this would be 100 I/Os, and take a second or so. The 40K rows would be consecutive ("clustered"), so there is some efficiency -- better that it be 100 I/Os than 40K I/Os. (INDEX(lat, long) might be able to hold about 400 rows in each 16KB block. 40K/400=100.)

"Clustered" in this context means that the rows you need will be next to each other in some sense. In the case of InnoDB, the PRIMARY KEY is 'clustered' with the data, and 'consecutive' PKs are consecutive in the table. Since InnoDB puts rows into 16KB blocks, and there are (typically) about 100 rows per block in the Data BTree, this means that one blocked fetched from disk will contain 100 rows with PKs near each other.

In the case of an AUTO_INCREMENT for the PK, this means that rows inserted about the same time will all be in the same block; rows inserted at a different time are likely to be in a different block.

In the case of PRIMARY KEY(`long', ...), it means that records near each other in an East-West sense will be near each other in the table.

In the case of INDEX(lat, long), it means that rows in the Index BTree (not in the Data BTree -- keep in mind that the data is in one BTree with the PRIMARY KEY; each secondary index is in a separate BTree) with similar North-South values of latitude will be near each other, and have some chance of being in the same block. (The fields after the first in the INDEX(...) are mostly irrelevant.)

I'm being a bit loose with applying "clustered" to PARTITIONing. But let's analyze it. Suppose you had one partition per degree of Latitude. That would be about 180 partitions (which is too many), then all the rows with latitude between 40N and 41N would be in the same partition; if this partition is 1% of the total then 1% of the rows are all 'together' in the same set of blocks that make up that partition. In a 100GB table that would be 1GB in that one partition. If that (somehow) made it so you did not have to to a table scan of 100GB, but instead do a 1GB scan of that partition, that would be a big performance improvement. (In this paragraph, PARTITIONing on lat is no better than INDEX(lat). PARTITIONing only becomes useful when something else is going on.)

My technique first takes advantage of "partition pruning" to limit the work to one (or a few) partitions based on latitude. Then it takes advantage of the PRIMARY KEY clustering in InnoDB to not have to scan far within the partition.

Back to your suggestions...
PRIMARY KEY(id) -- clusters on when things were inserted; this is likely to be irrelevant unless you inserted rows in, say, postal code order.
INDEX(lat, long) + WHERE lat BETWEEN ... AND ... (etc) -- This is somewhat equivalent to narrowing down the 100GB table to 1GB to scan. However, this is scanning in the INDEX, then it has to jump (randomly?) into the Data to further look at the rows. Inefficiency #1: It handles only one dimension, not two. Inefficiency #2: It requires many random lookups after it has narrowed things down to the latitude range.

Options: ReplyQuote


Subject
Written By
Posted
Re: Optimised way to search over 2 milllion poi data in mysql
September 07, 2014 10:40AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.