MySQL Forums
Forum List  »  Performance

Re: GROUP causing Using Temporary Using Filesort
Posted by: Rick James
Date: August 02, 2014 12:44PM

I should start by complimenting you on your thorough grasp of summary tables and the subtle gotchas involved.

Here are my blogs on the topic:
http://mysql.rjweb.org/doc.php/datawarehouse
http://mysql.rjweb.org/doc.php/summarytables
You have reached into areas where I need to augment those documents. Let me see If I can address some of your concerns...

One augmentation I want to make is to break down the take into 3-4 levels:
* At the low end (under 10 inserts/sec into Fact), one can simply do this whenever a row is inserted into the Fact (Detail) table:
INSERT INTO summary ... ON DUPLICATE KEY UPDATE sum_foo = sum_foo + foo, ...;
* At the high end (over 1K inserts/sec), one must pause every 1K rows or 1 minute (or whatever) and do
INSERT INTO SUMMARY SELECT key1, key2, sum(...), ... FROM tmp GROUP BY key1, key2;
* Plus techniques in between.

What I don't know is how to say "a DW application of size ___ falls into level ___".

So far, you have discussed the "high end" solution, but with an unacceptable periodicity (run the updates once a day).

Let me digress into a related issue...

Plan A: The Summary table has a PRIMARY KEY -- (accountid, Rundate, testid) in your case.
Plan B: Duplicate rows are allowed for that key. (Hence it cannot be the PK)

Here's the difference:
Plan A requires that you either
* A guarantee that all necessary rows are available when you do the summarization (eg, at midnight). This is folly; something _will_ go wrong, and you will have to repair things eventually.
* INSERT INTO summary ... ON DUPLICATE KEY UPDATE sum_foo = sum_foo + VALUES(foo), ...
SUMMARY SELECT key1, key2, sum(...), ... FROM tmp GROUP BY key1, key2;
That (or maybe something similar) will 'usually' insert a new row, but (when old data comes in late) do the IODKU to UPDATE the existing row, based on the PRIMARY KEY.
* If you Fact/Detail table has an AUTO_INCREMENT that is guaranteed monotonically increasing (watch out for multi-master solutions), you can "remember where you left off" and summarize all rows after that. (Note: A _huge_ Fact table cannot afford secondary indexes, so just an auto_inc is best.)

Plan B will lead to a few more than 3650 rows in your summary table because it will allow you to insert stragglers the next time you get around to summarizing. You might have
PRIMARY KEY (accountid, Rundate, testid, id) -- `id` makes it unique
INDEX(id) -- where `id` is AUTO_INCREMENT
Note that, in InnoDB, the data is sorted in PK order. This means that SELECTs with, say,
WHERE accountid=... AND rundate >= ... AND rundate < ...
work very efficiently.
Further note that the last in Plan A (the "left off" technique) works very well for Plan B.

Here's what I did in at least one project. (It used a resolution of 1 hour for the cube.)
* Hourly cron job to summarize from where I left off.
* The UI would run the same "left off" process "on demand".
Notes:
* The UI always had up-to-the-second tallies.
* The UI was slowed down slightly, but not too much. (This is similar to your UNION suggestion, except that I captured the summarization.)
* I did not have the cron job at first, but this is what happened: After a long weekend, the UI took a long time to summarize many hours worth of stuff. The cron job was a low cost way to prevent the issue.

I don't have a huge amount of experience in these problems -- they crop up only once a year or so. And each one seems to beg for a different solution. So...
* I don't have a strong opinion on whether to use the PK approach or the multiple-row approach.
* IODKU vs periodic summarization vs cron catchup -- This range of choices is driven in complex ways by (1) the size and ingestion speed and (2) the user requirements.

Here's another example that did not fit any of the molds: A multi-GB list of records arrived hourly. This allowed (or required) me to process them in batches. Obviously I had to finish the summarization in less than an hour. (I augmented 5 summary tables in about 7 minutes.) The Fact table grew to 350GB in 90 days (the required retention period). PARTITIONing was necessary for the purging. Replication (for BCP, etc) got tricky. I used ROW replication, MEMORY temp tables, and a few other subtle tricks. The summary tables did not exist in the prototype; 1-hour UI queries became a 2-second queries, which was quite acceptable.

Back to your comments...

> the results are fantastic!

That's what I like to hear. (And I expect to hear.)

> The query has gone from 6-8 seconds to just 0.08 seconds.

I typically quote 10x; I'm glad to hear it was 100x.

> that even as I add many days, weeks, and years to the summary table

You seem to understand that the summary table(s) needs to hold only daily rows, and does not also need weekly/monthly rows. Some developers try to have separate daily and weekly summary tables. This is usually a mistake because a weekly summary can be computed from daily data. (I did encounter a case where it was worth summarizing the daily summaries into a weekly table -- but that application had 1TB of data.)

Keep in mind that summarizing an AVERAGE() is SUM(sums)/SUM(counts). Standard deviation is possible (and requires 3 columns: count, sum, sum-of-squares).

> even after 10 years

I often suggest that the Fact table needs to be purged after N days, but summary tables can be kept "forever" since they are relatively small.

> I do have a question about keeping what I am calling the "cube" table up to date.

I have discussed further alternatives (above, and in the links); ponder them to see what fits best for you. Then we can discuss further.

> The first thing I needed of course was the cube of the detail table that covered all past history.

True. You need a one-time script. Or your cron job can be repeatedly run; see discussion below on why this might work.

> Then to help optimize performance I created this index on it:
> CREATE INDEX detail_stat on detailcube (accountid, rundate);

That leads to the proper choice of the PRIMARY KEY, etc, as discussed above. I generally recommend InnoDB, which works best when you give it a PK as you do the CREATE TABLE. In fact, InnoDB has to stand on its head when you create a table without a PK and then ADD PRIMARY (...) later. And adding a 'secondary' index on an InnoDB table without an explicit PK leads to an inefficient mess.

(Yeah, maybe you are using MyISAM -- but that is almost never advised these days.)

> WHERE rundate >= '2014-04-01' and rundate < '2014-08-01'

Well stated -- you avoided the "midnight bug" wherein you capture (or miss) a midnight (or first of month).

> or something else prevents the scheduled cron job from updating the cube for the previous day.

* Design the cron job so that it always checks to see what it needs to do. It does not blindly do exactly "yesterday".
* The "left off" technique often applies here.
* Design the job so that you can safely manually run it (for whatever reason). This probably implies an interlock to keep two copies from running. (OS: file lock; MySQL: GET_LOCK(); or some other approach.)
* Run the cron job more often than is needed, but have it do only one unit of work.

This last item needs some explanation.
* If a cron execution is missed (as you suggest), then it will run again soon and do the task.
* Since the cron job checks to see what is needed (and does the 'first' unit of work needed), it can (in the common case) not do anything. You might run the daily cron job once an hour; 23 out of 24 runs will do nothing.
* Why do only one unit of work? Why do more; it just clutters up the code.

I once had a daily task (fired off hourly) that broke badly. After about 3 days some side effects were beginning to show up. After another day, we figured out what went wrong and fixed it. Then, without further effort, the cron job caught up in 4 hours.

> The days results will be left out of the cube and forever not counted.

I hope I have convinced you that that is solvable.

> Does this UNION approach look solid?

Yes. I think I have used it in some project long ago. Now I look at the other techniques mentioned, and may never use it again. After all, the code is messy to write and maintain.

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: GROUP causing Using Temporary Using Filesort
3095
August 02, 2014 12:44PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.