MySQL Forums
Forum List  »  MyISAM

Re: myisamchk --sort-index
Posted by: Rick James
Date: January 15, 2014 11:41AM

> FROM X JOIN Y USING(ID) WHERE X.theYear=2012

Is that really what you have? Or are further details omitted?

For that case you state, the optimizer will almost definitely start with X, and will need INDEX(theYear, ID). Furthermore, that order would be optimal for sorting the data.

Then the JOIN will need to reach into Y. That necessitates an INDEX(ID) (you probably have PRIMARY KEY(ID), which suffices). Sorting Y data on ID is unnecessary unless the IDs tend to track the years.

To further explain, "count the disk hits". If the indexes and/or data are larger than can be cached in RAM (at 100M rows, they almost certainly are), then the disk needs to be hit to perform the query. If the index or data records are 'adjacent' to each other, then fetching one disk block will get, say, 100 records. I/O is _much_ slower than CPU speed, so this 'clustering' leads to 100x speedup. (Actually, it is more like 10x, which is still worth noting.)

Actually, I do not have enough information to pick between these (for X):
INDEX(theYear, ID)
INDEX(theYear)
Some cases:

Case: SELECT Y.* FROM ... (Note: no X)
Then INDEX(theYear, ID) will be very good. The query will be performed using only the INDEX. The sort order of the data is irrelevant.

Case: SELECT X.*, Y.* FROM ... (Note: need other fields of X)
INDEX(theYear) is slightly better. The operation will have to reach into the data anyway, so having ID in the INDEX buys nothing, rather it makes the index bigger. In this case, sorting the data on theYear (or theYear+ID) is good because will lead to the data clustering I discussed above.

SubCase: As above, but lots of rows have 2012 (more than 30%).
Now the INDEX (neither flavor) will not be used. The optimizer will decide that a table scan is more efficient than bouncing between the index and the data. The sort order of the data will not help.

SubCase: 2012 is 'rare' (less than 10%).
Now either index is useful, and will be used. Sorting the data on theYear (or theYear+ID) is beneficial.

10-30% -- the boundary between the SubCases varies with the phase of the moon.

Caveat: The analysis will be different for InnoDB. With InnoDB, I would ask if (theYear, ID) could be the PRIMARY KEY. If so, then InnoDB is likely to be faster, and all Cases collapse into a single one. Note that there won't be any bouncing between the PK and the data. And the data is _always_ in PK order; no need for myisamchk.
Hint: Consider switching to InnoDB.

Off to another topic... With 100M rows, PARTITIONing _may_ be reasonable. (This applies to either MyISAM or InnoDB, but with different details.)

It is especially desirable if you plan on purging 'old' data. In this case, you would PARTITION BY RANGE(theYear), and use DROP PARTITION instead of DELETE. Much, much, faster.

If you don't have that use case, I need to see more of what is going on before advising on PARTITIONing. Please provide SHOW CREATE TABLE, some clues about how the data is loaded and updated, plus a sampling of the different SELECTs.

Options: ReplyQuote


Subject
Views
Written By
Posted
3222
January 14, 2014 09:00AM
Re: myisamchk --sort-index
2118
January 15, 2014 11:41AM
1962
January 16, 2014 02:58AM
1828
January 16, 2014 11:15PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.