covariance calculation - need help optimizing query
Hi,
I have time series data for about 500 entities and I am trying to evaluate its covariance matrix. The data is in a table (Observations)with the following format
name, char(20)
observationType, enum {'daily','weekly','monthly'}
date, date
values, double
mean1, double
mean2, double
covar, double
The query is
select avg(a.values) as mean1, avg(b.values) as mean2,
avg(a.values * b.values) - avg(a.values) * avg(b.values) as covar
max(a.date) as timeStamp
from Observations a, Obseravtions b where
STRCMP(a.name,b.name)<=0 and
a.date=b.date and
a.observationType=b.ObservationType
group by a.name, b.name
With about a million records in the Observations table, it takes about
2 hours to run this query and it results in about 125,000 rows.
There are about 500 (=485) unique names for the column name. The
condition STRCMP(a.name,b.name)<=0 is used as covariance(m,n) is the
same as covariance(n,m). This is on a 1.3GHz intel machine.
I have a joint index on observationType and date.
Explain table says that it does use those indices and for table a
it gives about 200K rows and for table b it gievs about 300 rows.
I am wondering if this query could be improved ?
- My hunch is that the STRCMP function may be evaluated for each combination of data point while evaluating avg(a.values * b.values)
and thus maybe slowing down the query.
I am wondering if there are other ways to speed up this query ?
Thanks,
Harsha