MySQL Forums
Forum List  »  Optimizer & Parser

Random-row queries
Posted by: James Yopp
Date: April 29, 2007 06:35PM

I want to propose an addition to the optimizer...

I have an app that pulls random rows from the database, and have found it is incredibly slow and computationally wasteful to get rows using the suggested query style:

SELECT * FROM Table WHERE <<conditions>> LIMIT 5 ORDER BY RAND()

I have made this operation literally hundreds of times faster (at least for my table of tens of thousands of rows) by using the following php code in the app that accesses the database:

$max = $db->getScalar("SELECT Max(ID) FROM Table");
$result = array();
while (count($result) < 5) {
___ $id = mt_rand(1, $max);
___ $t = $db->getOne("SELECT * FROM Table WHERE <<conditions>> AND ID=$id");
___ if ($t && !array_key_exists($t->ID, $result)) {
______ $result[$t->ID] = $t;
___ }
}

At the end of this code, the $result variable will contain five random entries matching these conditions. But, it will have executed an absolute minimum of six queries against the database (many more if the key space is sparsely populated). And, if any joins are involved, it gets especially wasteful (and may require the creation of a temporary table with the results being sought).

As I mentioned, this is already hundreds of times faster than ORDER BY RAND(). However, it is difficult to figure out and difficult to use, and still potentially very wasteful. The algorithm I am proposing should be used when the pattern "LIMIT X ORDER BY RAND()" is detected, and X is small relative to the expected resultset size:

1. Select a random, previously unselected primary key from the first table in the execution plan. If there are no such unselected primary keys, return the resultset as-is.
2. Evaluate any conditions related to the primary table. If the row does not meet these conditions, return to step 1.
3. Evaluate any joins, and check conditions on those tables. If a non-left join fails, or a condition fails, return to step 1. If any join results in multiple rows, immediately choose one row arbitrarily before performing the next join.
4. Add the row to the resultset. If the resultset has reached the desired size, return it to the client.

Step (1) could be optimized by using a static array; Create an array of row pointers of the appropriate size. Access the row pointers in primary b-tree order, but add them to the array in random positions. If a position is occupied, put the row pointer in the first available position that either precedes or follows that position, wrapping around. Then, simply iterate over the positions until the row count is satisfied, or there are no more rows. If duplicates are to be allowed, items should be added in order to the array and chosen randomly.

This will have some amount of variability in execution time, but it should be significantly less resource-intense than even my home-grown version.

Given the usefulness of random-row results for websites, and MySQL's widespread use for such purposes, I think this would be a good addition to the optimizer and execution engine. It would be especially helpful for choosing random advertisements, articles, gallery pictures, blog posts, etc.

Options: ReplyQuote


Subject
Views
Written By
Posted
Random-row queries
5756
April 29, 2007 06:35PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.