Re: Adding index kills query
After several suggestions, I have now fixed two issues with my original code: (a) I added primary keys to all tables; (b) I changed the Transaction Amount variable from float to decimal.
To clarify a few issues, this database does not change - it was provided to me to analyze, so there are no writes to it. It lives on a standalone computer, I am the only user, and I have generally only run one process against the database at a time. I would like to perform a variety of analyses on the main table, which has 350 million rows, including summing by various category variables, looking for patterns within those sums, etc.
In an effort to understand the problem with the query performance, I ran a series of tests, which involved creating a table, Trans_smp_micro, consisting of a subset of the original table of various sizes, and running two queries:
select iCompanyID, sum(dcPurchaseAmt)
FROM Trans_smp_micro
GROUP BY iCompanyID
;
select iCompanyID, sum(dcPurchaseAmt)
FROM Trans_smp_micro
ignore index (idxCompBrand)
GROUP BY iCompanyID
;
I ran this query on subsets varying from 30K to 19M rows, using both the InnoDB and MyISAM engines, and found the following patterns (all performance reported as execution time seconds per million rows):
InnoDB ignoring index: 1.5 s/M for less than 1M rows, then jumping to 4.5 s/M up to 4M rows. I have never run the query on bigger sets to completion
InnoDB using index: increasing from 3.0 s/M at 30K to 3.9 s/M at 500K, then jumping to 238 s/M at 1M, 850 s/M at 2m, and 1,257 s/M at 4M. I have never run the query on bigger sets to completion.
MyISAM ignoring index: 0.9 s/M up to 17M rows, then jumping to 1.8 s/M for 19M rows
MyISAM using index: 0.9 s/M up to 17M rows, then jumping to 1,450 s/M for 19M rows
Several observations:
- With both engines, both using and ignoring the index, the performance remains relatively stable until the row count hits a threshold, at which point the performance deteriorates. I assume this is because the system runs out of a resource, probably RAM allocated to the sort function, and changes the approach.
- MyISAM is always faster than InnoDB for the same size, and the performance with MyISAM does not deteriorate until I get to much bigger sets.
- With MyISAM, the performance using and ignoring the index is roughly identical until the system hits the threshold, at which point the execution time doubles ignoring the index, but increases by a factor of more than 1000 using the index.
- with InnoDB, ignoring the index is always at least twice as fast as using the index, and the performance wall is much smaller: execution time doubles without the indexs, while it increases by factor of hundreds using the index
Which leaves me with a few questions:
- Why is using an index not helping, and sometimes badly hurting, the performance?
- Are there resource settings that I can adjust to avoid the performance threshold? Ultimately, once I'm done testing and tuning, I need to run a series of queries on the full set, with 350M rows.