MySQL :: Re: Strategy needed for picking random rows

New Topic

Re: Strategy needed for picking random rows

Posted by: Rick James
Date: September 09, 2011 08:51AM

Scenario 1, without the subquery:
1. The WHERE + ORDER BY + LIMIT without a 'perfect' index will lead to creating a temporary table of all the rows matching the WHERE clause. This may be a lot more than 200 rows. It will have all the columns that need to be returned, including the 'bulky' columns even for rows that will later be tossed (because of the LIMIT). So, it may physically be quite large.
2. That large temp table is sorted (to achieve the ORDER BY).
3. 200 rows are delivered.

Scenario 2, with the subquery:
1. In the subquery, the WHERE + ORDER BY + LIMIT without a 'perfect' index will lead to creating a temporary table of all the rows matching the WHERE clause. This may be a lot more than 200 rows. However, in this Scenario, it will have only the id and what is needed for the ORDER BY, so it won't be very bulky.
2. The subquery will create another temporary table. It will contain only (up to) 200 rows, with only the ids -- so very small. And it will use the MEMORY engine (see below).
3. The outer query JOINs those 200 ids to the other table (actually the same table) 200 times to reach for the 'bulky' stuff. These rows are delivered as they are built (no further tmp tables).

Scenario 2 is more complex, and has more steps, however it _may_ be faster...

The key difference is how many "bulky" cells are shoveled around, and how often. In the Scenario 1, it could be thousands (millions?) of rows, multiple times (the sort). In Scenario 2, only 200 rows, and only once.

Now to discuss caching... If you have enough RAM, and the caching parameters are tuned 'right', and the table is already cached, and the temp tables can also fit in ram, then the Scenario 1 won't be much slower than Scenario 2. But if any of those fail, the difference could be big.

See
http://mysql.rjweb.org/doc.php/memory
for discussion of tuning the caches.

The "perfect" index... First, note that the "data" is kept in one place, and each "index" is kept in another. If the all the fields needed for the WHERE (etc) in the inner select exist in a single, compound, index, then only the index, not the data, need by scanned. When you add haircolor to the WHERE clause (but not the INDEX), it now has to read (but not shovel around) the data. So, there will be a slowdown (for adding haircolor), but still Scenario 2 _may_ be faster.

There are too many factors for me to be more definitive. You are best off timing both approaches. Even that could be inconclusive because of changes in what happens to be cached at the time you run a test.

Another 'factor'... Temporary tables try to use MEMORY (faster), but may have to fall back on MyISAM (slower). Reasons for using MyISAM: temp table too big; use of TEXT or BLOB; etc. MEMORY also inflates the storage requirements by turning VARCHAR into CHAR, thereby hastening the switch to MyISAM. MEMORY does not touch the disk. MyISAM might hit the disk, but it is difficult to predict how much.

Hitting the disk is a big factor in performance.

Navigate: Previous Message• Next Message

Options: Reply• Quote

Subject

Views

Written By

Posted

Strategy needed for picking random rows

3422

Joachim Berger

September 07, 2011 11:27AM

Re: Strategy needed for picking random rows

1426

Øystein Grøvlen

September 08, 2011 01:08AM

Re: Strategy needed for picking random rows

1200

Joachim Berger

September 08, 2011 01:17AM

Re: Strategy needed for picking random rows

1170

Rick James

September 08, 2011 07:51AM

Re: Strategy needed for picking random rows

1250

Joachim Berger

September 08, 2011 08:05AM

Re: Strategy needed for picking random rows

1209

Rick James

September 09, 2011 08:51AM

Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.