MySQL Forums
Forum List  »  Optimizer & Parser

Re: Optimize sub query with large group by
Posted by: Øystein Grøvlen
Date: March 18, 2011 10:58AM

Alex K Wrote:
-------------------------------------------------------
> Oystein your query works however it takes longer
> and still includes the group by object_id which
> seemed to be what was taking so long.

I feared that. If you look at the plan for my query, it is as follows (MySQL 5.5):

mysql> EXPLAIN SELECT person_id, COUNT(id) FROM events WHERE (object_id, time) IN (SELECT object_id, MAX(time) FROM events GROUP BY object_id) GROUP BY person_id;
+----+--------------------+--------+-------+---------------+-----------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------+-------+---------------+-----------+---------+------+------+-------------+
| 1 | PRIMARY | events | index | NULL | person_id | 4 | NULL | 8 | Using where |
| 2 | DEPENDENT SUBQUERY | events | index | NULL | object_id | 4 | NULL | 4 | |
+----+--------------------+--------+-------+---------------+-----------+---------+------+------+-------------+
2 rows in set (0.00 sec)

This means that it will scan the events table using the person_id
index and for each row (pers, obj, t) it will execute something like the
following query:

SELECT object_id, MAX(time) FROM events
GROUP BY object_id
HAVING object_id = obj AND max(time) = t;

For this query it will use the object_id index. Hence, it will scan
all rows with the same object_id, looking for the maximum time. In
other words, there will be a lot of partial scans of the object_id
index. I tried adding a (object_id, time) index since in theory it
should be possible to find max(time) faster that way, but this index
did not seem to be used.

We are working on improving the execution of such queries by first
materializing the subquery. Then, for each row of events a direct
lookup on the materialized table using a hash index would give the
result. (See http://forge.mysql.com/worklog/task.php?id=1110.)

>
> I have been testing this on a laptop and hadn't
> increased the buffer_pool_size much. I checked the
> IO and most of it was fetching from disk.
> Thanks Rick.
>
> I tested the same query on the actual server and
> it only took 4 sec compared to 120 sec. This is
> much faster but I'm not sure what I should be
> expecting the speed to be (could it be faster?).
>
> Still wonder if there is a better way to get the
> same result with a different query.

If you look at the query plan for your query:

mysql> EXPLAIN SELECT person_id, COUNT(id) FROM (SELECT id, person_id FROM (SELECT id, person_id,object_id FROM events ORDER BY time DESC ) AS a GROUP BY object_id ) AS subquery GROUP BY person_id;
+----+-------------+------------+------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+------+---------+------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | Using temporary; Using filesort |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 8 | Using temporary; Using filesort |
| 3 | DERIVED | events | ALL | NULL | NULL | NULL | NULL | 8 | Using filesort |
+----+-------------+------------+------+---------------+------+---------+------+------+---------------------------------+
3 rows in set (0.00 sec)

You see there is a lot of filesort going on. First, the events table
is sorted on time and stored in a temporary table. This table is then
sorted on object_id to be able to do aggregation, and the result is
stored in a second temporary table. Finally, this temporary table is
sorted to group on person_id.

What impacts the cost of sorting most is the sort_buffer_size. If
there is not enough room in the sort buffer for all rows to be sorted
at once, the data needs to be sorted in several passes and temporary
results needs to be written to disk. Hence, a too small
sort_buffer_size will mean a lot of extra I/O.

When tuning the sort_buffer size, the following command is useful:

SHOW STATUS LIKE "sort%";

The reported number of "sort_merge_passes" tells you how many passes
are needed to sort the data. Usually, the lower the better.
(Remember to do "FLUSH STATUS;" between queries to reset the
statistics.)

Øystein Grøvlen,
Senior Principal Software Engineer,
MySQL Group, Oracle,
Trondheim, Norway

Options: ReplyQuote


Subject
Views
Written By
Posted
3854
March 10, 2011 07:44PM
Re: Optimize sub query with large group by
3579
March 18, 2011 10:58AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.