MySQL Forums
Forum List  »  Performance

Re: GROUP causing Using Temporary Using Filesort
Posted by: RT Lee
Date: August 02, 2014 08:23AM

Hi Rick,

Thank you so much. I was quickly able to implement a summary table as a solution as you suggested and the results are fantastic! The query has gone from 6-8 seconds to just 0.08 seconds. The great thing about this solution is that even as I add many days, weeks, and years to the summary table it will stay just about as fast, because even after 10 years that still only has the summary table with only 3650 rows no matter how many rows make up an accounts detail each day. Nice!

I do have a question about keeping what I am calling the "cube" table up to date. Before I get to that I will first explain the approach I took so you can please let me know if its good or could be better. Then I will ask a question about maintenance of the cube.

The first thing I needed of course was the cube of the detail table that covered all past history. This was quite easy to construct:

CREATE TABLE detailcube as
SELECT s.accountid, cast(s.lastupdate as date) RunDate,
testid as 'TestID',
sum(case when resultcode = 1 then 1 else 0 end) as Category1,
sum(case when resultcode = 2 then 1 else 0 end) as Category2,
sum(case when resultcode = 0 then 1 else 0 end) as Category3
FROM detail d, summary s
WHERE s.lastupdate < '2014-08-02 00:00:00'
and s.runid = d.runid
GROUP BY s.accountid, Rundate, testid;

Then to help optimize performance I created this index on it:
CREATE INDEX detail_stat on detailcube (accountid, rundate);

Now within 100ms I can get the same count results as before just by specifying an account ID and date range:

SELECT testid as 'TestID', sum(Category1) as Category1, sum(Category2) as Category2, sum(Category3) as Category3
FROM detailcube
WHERE rundate >= '2014-04-01' and rundate < '2014-08-01'
AND accountid='xyz'
GROUP BY testid


This brings up the question of how to best keep the cube up to date and I'd like your opinion on that please. Creating that table with historical usage was just a one-time thing to get the back data. I'm thinking that from here I will have a nightly batch file that kicks off at midnight that does this:

INSERT INTO detailcube
SELECT s.accountid, cast(s.lastupdate as date) RunDate,
...
same query as before
sum(case when resultcode = 0 then 1 else 0 end) as Pass
FROM detail d, summary s
WHERE s.lastupdate >= YESTERDAY and s.lastupdate < TODAY
and s.runid = d.runid
GROUP BY s.accountid, Rundate, testid;

The purpose of this query is to add each days summary of the detail to the cube table as soon as the day completes. This will run nightly. So for example on Thursday at 12:00 AM it will add all of Wednesday data to the cube. I could do this each hour instead of each day, but for our current data size that may be more frequent than we need.

* This brings up two main concerns. First, what happens if the machine id down for maintenance briefly one night or something else prevents the scheduled cron job from updating the cube for the previous day. The days results will be left out of the cube and forever not counted.

The second concern is that lets say it takes that nightly maintenance 5 minutes to
run each night. That means that any reports run after midnight for 5 minutes will be missing that days data. I guess I could add logic that looks to see what the max(rundate) is in the cube and use that to expand on the non-cubed part of the query (which I'll explain next) but that could (?) result in duplicate rows being counted if that INSERT of the cron job finishes before any ongoing select completes.

The other part of all this is that when a user pulls up a report on their account, in addition to showing these counts for the past, it needs to include the results from "today". And "today" is not in the cube until tomorrow.

So this means my query to pull their counts needs to consist of the counts from the cube query, combined with the counts from today's data which is not yet in the cube. To do this later part I can use the query I use now. This will be a little slower since it may have 5000-10000 rows in it, but still fast enough. And if it grows bigger I will need to create an hourly cube that is used until that data is swept into the daily cube. But that's a level of complexity I don't need - yet.

So putting this all together, here is the SQL that I came up with that gathers the counts from the cube, plus the counts from today not yet in the cube, done via a UNION:

SELECT Testid, sum(Category1) as Category1, sum(Category2) as Category2, sum(Category3) as Category3
FROM
(
SELECT testid as 'TestID', sum(Category1) as Category1, sum(Category2) as Category2, sum(Category3) as Category3
FROM detailcube
WHERE rundate >= START_DATE and rundate < TODAY
AND accountid='xyz'
GROUP BY testid

UNION

SELECT
testid as 'TestID',
sum(resultcode = 1) as Category1,
sum(resultcode = 2) as Category2,
sum(resultcode = 0) as Category3
FROM detail d, summary s
where s.accountid = 'xyz'
and s.lastupdate >= TODAY
and s.lastupdate < TOMORROW
and s.runid = d.runid
GROUP BY testid
) AS foo

GROUP BY testid;

Does this UNION approach look solid? In my testing it runs very quick even with the non-cubed part of the query (since that's typically just a few thousand rows instead of 500,000+ rows before the cube)?

Also I'd love to hear your thoughts on this approach overall, how to handle the concerns I mentioned above, and if there are any other optimizations or changes you'd suggest to what I have so far.

Thank you again for the excellent help! It is really exciting to have taken the performance to just 100ms on any date range!

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: GROUP causing Using Temporary Using Filesort
2854
August 02, 2014 08:23AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.