Partitioning and Datawarehouse time dimension
I am having trouble with the optimizer choosing the correct partitions in my very large fact table - any help or advice would be much appreciated.
I have a large time series fact table (100+ million rows, called fact_summary) in which I store collected metrics. I have partitioned this table based upon a time key. The time key is a reference into my time dimension (dim_time). However - I am finding the format of the queries on which the optimizer is able to do partition pruning very restrictive.
Details are as follows:
My fact table is partitioned by day, using a range partition on time_key. The time key points to the dim_time table. dim_time has one row per hour and many columns, mostly standard warehouse time dimension values such as day of week, hour of day etc. My fact table holds network management data measurements and users will query it using a reporting tool, wanting answers to questions such as 'How much traffic passed over this link yesterday?'
Queries where I placed a join to the dim_time and use a daily identifier in the where clause appear not to work as I would hope (all partitions are scanned). This query has the SQL format I would like but unfortunately looks at all partitions. It seems that the lookup to de-reference the day_id into a time_key is causing the optimizer to miss out on selecting the correct partitions. It references a single day identifier which should match rows which all sit within the same partition in fact_summary. However, an explain of the query tells me it is hitting all daily partitions on fact_summary.
Query 1
select sum(packets_in) from fact_summary, dim_time where fact_summary.time_key = dim_time.time_key and dim_time.day_id = '01D20070926'
Similarly, this version of the same query (written with a sub query to return a list of possible time keys) also scans all partitions:
Query 2
select sum(packets_in) from fact_summary where fact_summary.time_key in
(select time_key from dim_time where dim_time.day_id = '01D20070925')
But - the reason Query 2 doesn't work is because the list of times returned from my time dimension is evaluated using an IN. Although it is of no use to me, I note that if I provide an explicit list of 24 values for the time_key (as I have done here in Query 3) the optimizer picks up the right partitions !
Query 3
select sum(packets_in) from fact_summary where fact_summary.time_key in (200709260000, 200709260100, 200709260200 , 200709260300, 200709260400, 200709260500, 200709260600, 200709260700, 200709260800, 200709260900, 200709261000, 200709261100, 200709261200, 200709261300, 200709261400, 200709261500, 200709261600, 200709261700, 200709261800, 200709261900, 200709262000, 200709262100, 200709262200, 200709262300)
Finally, when the where clause contains max and min sub-queries for the time_key, which are then easily checked by the optimizer against the range limits of the partitions, I do get the correct partition selection:
Query 4
explain partitions select sum(packets_in) from fact_summary, dim_time where fact_summary.time_key >= (select min(time_key) from dim_time where dim_time.day_id = '01D20070926')
and fact_summary.time_key < (select max(time_key) from dim_time where dim_time.day_id = '01D20070926')
My issue is that I have very little control over how my queries are formed. I Would have thought that Query 1 is a very standard way for tables forming a data warehouse to be used and yet the MySQL 5.1.23 database doesn't seem to be able to support partitioning in that context. Am I missing something here or where can I get further advice on why queries 1 and 2 don't work as expected ?
Thanks - Damian Horner