MySQL Forums
Forum List  »  General

Trouble Optimizing a query - ex. time seems to increase exponentially
Posted by: michael cook
Date: December 17, 2014 10:02AM

First, to explain the data I'm querying I'll describe it in terms of weather. I have 62 locations that provide a forecast every hour for the subsequent rolling 24 hour window. I also have an actual (non-forecast) value for each hour as well. I store all of this data in the following table:

CREATE TABLE `weather_data` (
  `uID` int(10) unsigned NOT NULL,
  `LocID` enum('Loc1','Loc2',...,'Loc61','Loc62') NOT NULL,
  `type` enum('FC','ACT') NOT NULL,
  `value` decimal(7,2) DEFAULT NULL,
  `pts` datetime NOT NULL,
  `ptshour` mediumint(8) unsigned NOT NULL,
  `tshour` mediumint(8) unsigned NOT NULL,
  `ts` datetime NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Where 'uID' is a unique ID for each value, 'LocID' is the location, 'type' is whether the value is a forecast or actual, 'value' is the value of the forecast, 'pts' is "Published Time Stamp" (when the value was published), and 'ts' is "Time Stamp" (the hour for which the value is applicable). Also, for the purpose of some of the queries I do (explained later), ptshour and tshour are the # of hours since an arbitrary start date for the published and effective hour.

Also, as you have probably realized, this table is quite large (currently sitting at 15.7M records) and grows at something like 1500 records per hour.

That table has the following indices:
ALTER TABLE `weather_data`
  ADD UNIQUE KEY `uID` (`uID`),
  ADD KEY `ts` (`ts`),
  ADD KEY `type` (`type`),
  ADD KEY `pts` (`pts`),
  ADD KEY `LocID` (`LocID`,`ts`),
  ADD KEY `tsptshour` (`ptshour`,`tshour`);

As alluded to above, the ptshour and tshour are incredibly useful fields for doing things like this:
SELECT ts,value FROM weather_data WHERE LocID='Loc1' AND ts>='2014-12-01' AND ts<'2014-12-02' AND tshour=ptshour+5 ORDER BY ts

which will give me the forecast for every hour of Dec 1st as published 5 hours prior (ie 00:00 published at 21:00, 01:00 published at 22:00, etc) and runs very quickly as long as I keep the filter parameters fairly small. for example I just ran a month for one location in 0.09s. I could have also added "AND type='FC'" but in this case that is implied because of the tshour=ptshour+5 constraint (ptshour=tshour+1 for all type='ACT' records)

So here's where my problem comes in. I'm looking for a very specific set of data:
for hour X:
-the forecast as determined 3 hours prior (fc3 ['fcast 3hrs prior'])
-the forecast for the subsequent 3 hours (hp1, hp2, hp3 ['hour plus 1, 2 or 3'])
-the actual value ('act')

Individually, the queries for each for Dec 1st are:
fc3:
SELECT fc3.value FROM weather_data fc3 WHERE fc3.LocID='Loc1' AND fc3.ts>='2014-12-01' AND fc3.ts<'2014-12-01' AND fc3.tshour=fc3.ptshour+3
hp1:
SELECT hp1.value FROM weather_data hp1 WHERE hp1.LocID='Loc1' AND hp1.ts>='2014-12-01 01:00:00' AND hp1.ts<'2014-12-01 01:00:00' AND hp1.tshour=hp1.ptshour+1
hp2:
SELECT hp2.value FROM weather_data hp2 WHERE hp2.LocID='Loc1' AND hp2.ts>='2014-12-01 02:00:00' AND hp2.ts<'2014-12-01 02:00:00' AND hp2.tshour=hp2.ptshour+2
hp3:
SELECT hp3.value FROM weather_data hp3 WHERE hp3.LocID='Loc1' AND hp3.ts>='2014-12-01 03:00:00' AND hp3.ts<'2014-12-01 03:00:00' AND hp3.tshour=hp3.ptshour+3
act:
SELECT act.value FROM weather_data act WHERE act.LocID='Loc1' AND act.ts>='2014-12-01' AND act.ts<'2014-12-02' AND act.tshour=act.ptshour-1

Note: 'act' could use "act.type='ACT'" instead of "act.tshour=act.ptshour-1"...
Also, for each of the hpX data sets, the data is shifted by X hours, hence the hours added to the start and end time constraints

So individually, each of these queries works great as described previously. The problem comes when I try to join them all together. The following query works - but seems to take exponentially longer the more data I pull. One day takes less than a second, but 5 days takes a minute and a half. Beyond that I haven't let it continue running long enough to find out how long it actually takes because even 1 minute is too much but the end goal is to be able to grab a month at a time.
SELECT
	CAST(fc3.ts AS DATE) AS 'date',
	DAYOFYEAR(fc3.ts)*24+HOUR(fc3.ts)-23 AS 'hrCnt',
	fc3.value as 'fc3',
	hp1.value as 'hp1',
	hp2.value as 'hp2',
	hp3.value as 'hp3',
	act.value as 'act'
FROM
	weather.weather_data fc3,
	weather.weather_data hp1,
	weather.weather_data hp2,
	weather.weather_data hp3,
	weather.weather_data act
WHERE
	fc3.LocID='Loc1' AND 
	fc3.tshour=fc3.ptshour+3 AND 
	fc3.ts>='2014-12-01' AND 
	fc3.ts<'2014-12-02' AND 
	hp1.LocID='Loc1' AND 
	hp1.tshour=hp1.ptshour+1 AND 
	hp1.ts>='2014-12-01 01:00:00' AND 
	hp1.ts<'2014-12-02 01:00:00' AND 
	hp2.LocID='Loc1' AND 
	hp2.tshour=hp2.ptshour+2 AND 
	hp2.ts>='2014-12-01 02:00:00' AND 
	hp2.ts<'2014-12-02 02:00:00' AND 
	hp3.LocID='Loc1' AND 
	hp3.tshour=hp3.ptshour+3 AND
	hp3.ts>='2014-12-01 03:00:00' AND 
	hp3.ts<'2014-12-02 03:00:00' AND 
	act.LocID='Loc1' AND 
	act.tshour=act.ptshour-1 AND 
	act.ts>='2014-12-01' AND 
	act.ts<'2014-12-02' AND 
	fc3.tshour=hp1.tshour-1 AND 
	fc3.tshour=hp2.tshour-2 AND 
	fc3.tshour=hp3.tshour-3 AND 
	fc3.tshour=act.tshour

Note that the last four constraints allow me to match the rows returned from each of the individual joins together (including a reverse offset for the hpX values)

the "EXPLAIN SELECT" for the above says:
+----+-------------+-------+-------+---------------+-------+---------+------+------+---------------------------------------------------------------------------+
| id | select_type | table | type  | possible_keys | key   | key_len | ref  | rows | Extra                                                                     |
+----+-------------+-------+-------+---------------+-------+---------+------+------+---------------------------------------------------------------------------+
|  1 | SIMPLE      | hp1   | range | ts,LocID      | LocID | 6       | NULL |  581 | Using index condition; Using where                                        |
|  1 | SIMPLE      | hp2   | range | ts,LocID      | LocID | 6       | NULL |  584 | Using index condition; Using where; Using join buffer (Block Nested Loop) |
|  1 | SIMPLE      | fc3   | range | ts,LocID      | LocID | 6       | NULL |  579 | Using index condition; Using where; Using join buffer (Block Nested Loop) |
|  1 | SIMPLE      | act   | range | ts,LocID      | LocID | 6       | NULL |  579 | Using index condition; Using where; Using join buffer (Block Nested Loop) |
|  1 | SIMPLE      | hp3   | range | ts,LocID      | LocID | 6       | NULL |  587 | Using index condition; Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+-------+---------------+-------+---------+------+------+---------------------------------------------------------------------------+
and the "rows" column increases by about 580 per day

I'm hoping you might have some suggestions or solutions that can help. I'm open to ideas about restructuring the table itself, but would rather avoid it because of the effort involved with translating the format and adjusting all the tooling currently used to fill it.

Thank you for any help or insight you can provide!






PS: as an aside, but the end result actually uses the above as a subquery. The full version would be joined to a daily data table (as opposed to hourly) as follows:
SELECT
	sq.*,
	d.daily_data_column,
	d.other_daily_data_column
FROM	
	(<ABOVE QUERY>) AS sq,
	otherdb.dailydata d
WHERE
	d.ts=sq.date AND 
	d.ts>='2014-12-01' AND 
	d.ts<'2014-12-02'
ORDER BY
	sq.hrCnt


but let's focus on the subquery first, as it appears to be slowing the works down. Though I did notice that the EXPLAIN results are a bit scary looking (see row 2). I assume that's not a real number:
+----+-------------+------------+-------+---------------+-------------+---------+-----------------+----------------+---------------------------------------------------------------------------+
| id | select_type | table      | type  | possible_keys | key         | key_len | ref             | rows           | Extra                                                                     |
+----+-------------+------------+-------+---------------+-------------+---------+-----------------+----------------+---------------------------------------------------------------------------+
|  1 | PRIMARY     | d          | range | ts            | ts          | 3       | NULL            |              1 | Using index condition;Using temporary; Using filesort                     |
|  1 | PRIMARY     | <derived2> | ref   | <auto_key0>   | <auto_key0> | 4       | otherdb.d.ts    | 22350208070516 | NULL                                                                      |
|  2 | DERIVED     | hp3        | range | ts,type,LocID | LocID       | 6       | NULL            |           2856 | Using index condition;Using where                                         |
|  2 | DERIVED     | hp2        | range | ts,type,LocID | LocID       | 6       | NULL            |           2857 | Using index condition;Using where; Using join buffer (Block Nested Loop)  |
|  2 | DERIVED     | hp1        | range | ts,type,LocID | LocID       | 6       | NULL            |           2858 | Using index condition;Using where; Using join buffer (Block Nested Loop)  |
|  2 | DERIVED     | fc3        | range | ts,type,LocID | LocID       | 6       | NULL            |           2859 | Using index condition;Using where; Using join buffer (Block Nested Loop)  |
|  2 | DERIVED     | act        | range | ts,type,LocID | LocID       | 6       | NULL            |           2859 | Using index condition;Using where; Using join buffer (Block Nested Loop)  |
+----+-------------+------------+-------+---------------+-------------+---------+-----------------+----------------+---------------------------------------------------------------------------+

Options: ReplyQuote




Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.