Best way to store/access large amounts of data?
Posted by: Mike Cook
Date: June 10, 2009 07:52AM

I'm trying to create a database to store a large amount of data. All of the data is based on time, so I have some that are daily values (ex: closing price of a stock), I have some that are hourly values (ex: the temperature) and I have some that are 5-minute data (ex: speed of a vehicle queried every 5 mins).

I thought the way that made the most sense was to store all the data in the same table and use two columns to distinguish each datapoint: frequency and type (source). I have a lot of 5-minute data, so as you can imagine the database grows fairly rapidly (ex: 288 values per day * 15 points * 30 days = 129,600 rows/month)

Here is what I came up with to store this data:
CREATE TABLE `data` (
  `ID` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `freq` enum('5min','hourly','daily') NOT NULL,
  `type` enum('5a','h1','h2','h3') NOT NULL,
  `year` smallint(6) unsigned NOT NULL,
  `month` tinyint(4) unsigned NOT NULL,
  `day` tinyint(4) unsigned NOT NULL,
  `hour` tinyint(4) unsigned NOT NULL DEFAULT '0',
  `5min` tinyint(4) unsigned NOT NULL DEFAULT '0',
  `value` float DEFAULT NULL,
  `inserted` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`ID`),
  KEY `freq` (`freq`),
  KEY `type` (`type`),
  KEY `year` (`year`),
  KEY `month` (`month`),
  KEY `day` (`day`)
) ENGINE=MyISAM ;

'type' has many more values in my version, but I'm keeping it simple for this post.

In case you're wondering, I separated year, month, day, hour and 5min out (as opposed to just using a datetime column) because I often need to query against those values, and it seems like it would be faster than having to calculate the values on the fly (ie DAY(datecol), MONTH(datecol) etc) but maybe I'm wrong.

This structure seems to work fine for grabbing chunks of data for a day or hour or whatever, but I run into problems when things start getting more complicated.

Let's say I want to get 'h1','h2', and 'h3' by hour for a given day. I have to do something like this: [EX: A]
SELECT d1.`value` AS h1, d2.`value` AS h2, d3.`value` AS h3
FROM `data` d1, `data` d2, `data` d3
WHERE d1.`year`=2009 AND d1.`month`=6 AND d1.`day`=10 AND
  d1.`year`=d2.`year` AND d1.`year`=d3.`year` AND
  d1.`month`=d2.`month` AND d1.`month`=d3.`month` AND
  d1.`day`=d2.`day` AND d1.`day`=d3.`day` AND
  d1.`hour`=d2.`hour` AND d1.`hour`=d3.`hour` AND
  d1.`type`='h1' AND d2.`type`='h2' AND d3.`type`='h3'
ORDER BY d1.`year`, d1.`month`, d1.`day`, d1.`hour`

Ok, that's no problem, although it gets a bit long as I add columns to the result set (as I have to keep adding joins)

Let's get a little bit more complicated. Now I want to grab that same data AND the average of the 5-minute data for the same time period. This seems to work OK: [EX: B]
SELECT d1.`hour`, d1.`value` AS h1, d2.`value` AS h2, d3.`value` AS h3, d4.`value` AS h4, ROUND(AVG(d5.`value`),2) AS 5aAVG
FROM `data` d1, `data` d2, `data` d3, `data` d4, `data` d5
WHERE d1.`year`=# AND d1.`month`=# AND d1.`day`=# AND
    d1.`year`=d2.`year` AND d1.`year`=d3.`year` AND d1.`year`=d4.`year` AND d1.`year`=d5.`year` AND
    d1.`month`=d2.`month` AND d1.`month`=d3.`month` AND d1.`month`=d4.`month` AND d1.`month`=d5.`month` AND
    d1.`day`=d2.`day` AND d1.`day`=d3.`day` AND d1.`day`=d4.`day` AND d1.`day`=d5.`day` AND
    d1.`hour`=d2.`hour` AND d1.`hour`=d3.`hour` AND d1.`hour`=d4.`hour` AND d1.`hour`=d5.`hour` AND
    d1.`type`='h1' AND d2.`type`='h2' AND d3.`type`='h3' AND d4.`type`='h4' AND d5.`type`='5a'
GROUP BY d5.`year`, d5.`month`, d5.`day`, d5.`hour`
ORDER BY NULL

But here's where things get *REALLY* nasty. Some of these values (particularly the 5-minute ones) may get updated later on. I *ONLY* want the newest. For example, let's say we have this:

+--------+------+------+------+-------+-----+------+------+-------+---------------------+
| ID     | freq | type | year | month | day | hour | 5min | value | inserted            |
+--------+------+------+------+-------+-----+------+------+-------+---------------------+
|   6673 | 5min | 5a   | 2009 |     6 |   2 |    5 |    1 |   0.1 | 2009-06-08 16:55:37 |
|   6674 | 5min | 5a   | 2009 |     6 |   2 |    5 |    2 |   0.3 | 2009-06-08 16:55:37 |
|   6675 | 5min | 5a   | 2009 |     6 |   2 |    5 |    3 |   0.3 | 2009-06-08 16:55:37 |
|   6676 | 5min | 5a   | 2009 |     6 |   2 |    5 |    4 |   0.4 | 2009-06-08 16:55:37 |
| 107279 | 5min | 5a   | 2009 |     6 |   2 |    5 |    5 |  2000 | 2009-06-10 00:21:11 |
|   6677 | 5min | 5a   | 2009 |     6 |   2 |    5 |    5 |   0.5 | 2009-06-08 16:55:37 |
|   6678 | 5min | 5a   | 2009 |     6 |   2 |    5 |    6 |     1 | 2009-06-08 16:55:37 |
|   6679 | 5min | 5a   | 2009 |     6 |   2 |    5 |    7 |     1 | 2009-06-08 16:55:37 |
|   6680 | 5min | 5a   | 2009 |     6 |   2 |    5 |    8 |     1 | 2009-06-08 16:55:37 |
|   6681 | 5min | 5a   | 2009 |     6 |   2 |    5 |    9 |  1.88 | 2009-06-08 16:55:37 |
|   6682 | 5min | 5a   | 2009 |     6 |   2 |    5 |   10 |  1.88 | 2009-06-08 16:55:37 |
|   6683 | 5min | 5a   | 2009 |     6 |   2 |    5 |   11 |   2.8 | 2009-06-08 16:55:37 |
|   6684 | 5min | 5a   | 2009 |     6 |   2 |    5 |   12 |  1.88 | 2009-06-08 16:55:37 |
+--------+------+------+------+-------+-----+------+------+-------+---------------------+

The query above [EX: B] will show this as the average: 154.85
However, the correct average should be: 167.71
because only the newest value for 5min #5 should be used.

So now I can't figure out an efficient way to query the average of the "newest unique values" AS WELL as the rest.

Suppose I query JUST the average, I tried this: [EX: C]
SELECT d1.`day`, d1.`hour`, ROUND(AVG(d1.`value`),2) AS 5aAVG
FROM `data` AS d1 JOIN (
    SELECT `year`, `month`, `day`, `hour`, `5min`, MAX(`inserted`) AS `inserted`
    FROM `data`
    WHERE `type`='5a' AND `year`=2009 AND `month`=6
    GROUP BY `year`, `month`, `day`, `hour`, `5min`
) AS q1
USING (`year`,`month`,`day`,`hour`,`5min`,`inserted`)
WHERE d1.`type`='mcp' AND d1.`year`=2009 AND d1.`month`=6
GROUP BY d1.`year`, d1.`month`, d1.`day`, d1.`hour`

and it takes almost 2 seconds to run (I had 6 days worth of data for month=6 when I tried this).

Now, what if I wanted to do this but include several other columns (ie do EX: B but using "newest unique" values. I've had a lot of difficulty coming up with solutions. Some of the ones I've tried so far have taken as much longer to run. For example, this query takes just under 11 seconds with the same data as above (6 days worth). The more columns I add, the worse it becomes. For example, EX: B converted to use "newest unique" values takes 21.3 seconds to run with 6 days worth of data: [EX: D] - Also, see at the bottom for a profiling breakdown and an explain query output for this:
SELECT d1.`day`, d1.`hour`, d2.`value` AS h1, d3.`value` AS h2, d4.`value` AS h3, d5.`value` AS h4, ROUND(AVG(d1.`value`),2) AS 5aAVG
FROM (
    `data` AS d1 JOIN
    (
        SELECT `year`, `month`, `day`, `hour`, `5min`, MAX(`inserted`) AS `inserted`
        FROM `data`
        WHERE `type`='5a' AND `year`=2009 AND `month`=6
        GROUP BY `year`, `month`, `day`, `hour`, `5min`
    ) AS q1
    USING (`year`,`month`,`day`,`hour`,`5min`,`inserted`)
),(
    `data` AS d2 JOIN
    (
        SELECT `year`, `month`, `day`, `hour`, `5min`, MAX(`inserted`) AS `inserted`
        FROM `data`
        WHERE `type`='h1' AND `year`=2009 AND `month`=6
        GROUP BY `year`, `month`, `day`, `hour`
    ) AS q2
    USING (`year`,`month`,`day`,`hour`,`inserted`)
),(
    `data` AS d3 JOIN
    (
        SELECT `year`, `month`, `day`, `hour`, `5min`, MAX(`inserted`) AS `inserted`
        FROM `data`
        WHERE `type`='h2' AND `year`=2009 AND `month`=6
        GROUP BY `year`, `month`, `day`, `hour`
    ) AS q3
    USING (`year`,`month`,`day`,`hour`,`inserted`)
),(
    `data` AS d4 JOIN 
    (
        SELECT `year`, `month`, `day`, `hour`, `5min`, MAX(`inserted`) AS `inserted`
        FROM `data`
        WHERE `type`='h3' AND `year`=2009 AND `month`=6
        GROUP BY `year`, `month`, `day`, `hour`
    ) AS q4
    USING (`year`,`month`,`day`,`hour`,`inserted`)
),(
   `data` AS d5 JOIN
    (
        SELECT `year`, `month`, `day`, `hour`, `5min`, MAX(`inserted`) AS `inserted`
        FROM `data`
        WHERE `type`='h4' AND `year`=2009 AND `month`=6
        GROUP BY `year`, `month`, `day`, `hour`
    ) AS q5
    USING (`year`,`month`,`day`,`hour`,`inserted`)
)
WHERE d1.`year`=2009 AND d1.`month`=6 AND
    d1.`year`=d2.`year` AND d1.`year`=d3.`year` AND d1.`year`=d4.`year` AND d1.`year`=d5.`year` AND
    d1.`month`=d2.`month` AND d1.`month`=d3.`month` AND d1.`month`=d4.`month` AND d1.`month`=d5.`month` AND
    d1.`day`=d1.`day` AND d1.`day`=d3.`day` AND d1.`day`=d4.`day` AND d1.`day`=d5.`day` AND
    d1.`hour`=d2.`hour` AND d1.`hour`=d3.`hour` AND d1.`hour`=d4.`hour` AND d1.`hour`=d5.`hour` AND
    d1.`type`='5a' AND d2.`type`='h1' AND d3.`type`='h2' AND d4.`type`='h3' AND d5.`type`='h4'
GROUP BY d1.`year`, d1.`month`, d1.`day`, d1.`hour`

One solution I can think of would be to add the hourly averages of the 5-min data into the database in the first place. I wanted to try to avoid that because it seems redundant, but that doesn't solve the "updated-ness" problem.
Another solution is to remove old values when I insert new ones (and/or simply update the old ones). This may be the best/easiest approach, but I kind of like the idea of being able to access older values if possible. It's not 100% necessary though.
Or perhaps I could do a combination of both, which would eliminate almost all my woes.

EDIT:
A last solution might be to leave it the way it is, and create another table for "final calculated results" which would be updated infrequently, rather than using the dynamic data from the data table. An obvious disadvantage to this is failing to update the calculated results when the data table changes.
END EDIT

Anyway, Hoping some people here might have other ideas - I'm open to anything - from revising the DB structure/changing it completely to different queries.

Thanks!





profiling output for EX: D
As you can see, the hangup occurs in the copy to the temp table:
starting              0.000234
Opening tables        0.001177
System lock           0.000004
Table lock            0.000215
optimizing             0.00001
statistics            0.000125
preparing             0.000017
Creating tmp table    0.000029
executing             0.000002
Copying to tmp table  0.021957
Sorting result        0.000236
Sending data          0.000669
removing tmp table    0.000004
Sending data          0.000002
optimizing            0.000007
statistics            0.000048
preparing             0.000015
Creating tmp table    0.000033
executing             0.000001
Copying to tmp table  0.001806
Sorting result        0.000027
Sending data          0.000062
removing tmp table    0.000003
Sending data          0.000002
optimizing            0.000006
statistics            0.000041
preparing             0.000012
Creating tmp table    0.000028
executing             0.000001
Copying to tmp table  0.001763
Sorting result        0.000025
Sending data          0.000061
removing tmp table    0.000003
Sending data          0.000002
optimizing            0.000006
statistics            0.000039
preparing             0.000012
Creating tmp table    0.000028
executing             0.000001
Copying to tmp table  0.001767
Sorting result        0.000025
Sending data           0.00006
removing tmp table    0.000004
Sending data          0.000002
optimizing            0.000006
statistics             0.00004
preparing             0.000012
Creating tmp table    0.000027
executing             0.000001
Copying to tmp table  0.001751
Sorting result        0.000024
Sending data          0.000062
removing tmp table    0.000003
Sending data          0.000003
init                  0.000139
optimizing            0.000046
statistics            0.009244
preparing             0.000116
Creating tmp table    0.000033
executing             0.000002
Copying to tmp table 21.400125
Sorting result         0.00003
Sending data          0.000853
end                   0.000002
removing tmp table    0.000007
end                   0.000004
query end             0.000002
freeing items         0.000085
removing tmp table    0.000007
closing tables        0.000001
removing tmp table    0.000004
closing tables        0.000001
removing tmp table    0.000015
closing tables        0.000001
removing tmp table    0.000003
closing tables        0.000001
removing tmp table    0.000004
closing tables        0.000005
logging slow query    0.000001
logging slow query    0.000002
cleaning up           0.000008

EXPLAIN query:
+----+-------------+------------+------+---------------------+------+---------+-------+------+----------------------------------------------+
| id | select_type | table      | type | possible_keys       | key  | key_len | ref   | rows | Extra                                        |
+----+-------------+------------+------+---------------------+------+---------+-------+------+----------------------------------------------+
|  1 | PRIMARY     | d1         | ref  | type,year,month,day | type | 1       | const | 7054 | Using where; Using temporary; Using filesort |
|  1 | PRIMARY     | <derived5> | ALL  | NULL                | NULL | NULL    | NULL  |   94 | Using where; Using join buffer               |
|  1 | PRIMARY     | <derived4> | ALL  | NULL                | NULL | NULL    | NULL  |   95 | Using where; Using join buffer               |
|  1 | PRIMARY     | <derived6> | ALL  | NULL                | NULL | NULL    | NULL  |   93 | Using where; Using join buffer               |
|  1 | PRIMARY     | d5         | ref  | type,year,month,day | type | 1       | const |  562 | Using where                                  |
|  1 | PRIMARY     | d3         | ref  | type,year,month,day | type | 1       | const |  564 | Using where                                  |
|  1 | PRIMARY     | d4         | ref  | type,year,month,day | type | 1       | const |  564 | Using where                                  |
|  1 | PRIMARY     | d2         | ref  | type,year,month,day | type | 1       | const |  565 | Using where                                  |
|  1 | PRIMARY     | <derived3> | ALL  | NULL                | NULL | NULL    | NULL  |   96 | Using where; Using join buffer               |
|  1 | PRIMARY     | <derived2> | ALL  | NULL                | NULL | NULL    | NULL  | 1152 | Using where; Using join buffer               |
|  6 | DERIVED     | data       | ref  | type,year,month     | type | 1       |       |  562 | Using where; Using temporary; Using filesort |
|  5 | DERIVED     | data       | ref  | type,year,month     | type | 1       |       |  564 | Using where; Using temporary; Using filesort |
|  4 | DERIVED     | data       | ref  | type,year,month     | type | 1       |       |  564 | Using where; Using temporary; Using filesort |
|  3 | DERIVED     | data       | ref  | type,year,month     | type | 1       |       |  565 | Using where; Using temporary; Using filesort |
|  2 | DERIVED     | data       | ref  | type,year,month     | type | 1       |       | 7054 | Using where; Using temporary; Using filesort |
+----+-------------+------------+------+---------------------+------+---------+-------+------+----------------------------------------------+



Edited 5 time(s). Last edit at 06/10/2009 10:09AM by Mike Cook.

Options: ReplyQuote


Subject
Written By
Posted
Best way to store/access large amounts of data?
June 10, 2009 07:52AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.