Query optimisation advice
Posted by:
Paul Feely
Date: January 08, 2015 11:02PM
I've having difficulty optimising a query
I have a table of 2.5 million participants with their particpant_id, postcode, age, gender etc
Participants can be included multiple times if they were at different clubs during the year
Each sport in the table can have between 400000 and 800000 rows (Not sure if worth separating?)
I need to convert postcode to LGA (Local Government Area), which is achieved by a table with weighted estimates of how much each postcode extends into each LGA
The join to the abs_postcode_lga table appears to be performing poorly.
The following query takes about 50 seconds
SELECT lga_name,lga_code,abs_postcode_lga.postcode,ratio,
COUNT(DISTINCT participant_id),
COUNT(DISTINCT participant_id)*ratio AS postcode_total,
COUNT(DISTINCT CASE WHEN `year` = 2010 THEN participant_id END)*ratio as postcode_total_start_year,
COUNT(DISTINCT CASE WHEN `year` = 2013 THEN participant_id END)*ratio as postcode_total_end_year
FROM ssa_participant
LEFT JOIN abs_postcode_lga ON abs_postcode_lga.postcode = ssa_participant.postcode
WHERE sport_id_fk = 18
AND `year` >= 2010
AND `year` <= 2013
AND abs_postcode_lga.postcode IS NOT NULL
GROUP BY lga_name,lga_code,abs_postcode_lga.postcode,ratio
EXPLAIN
1 SIMPLE abs_postcode_lga ALL PRIMARY 4049 Using where; Using filesort
1 SIMPLE ssa_participant ref year,sport_id_fk,postcode postcode 15 sports_spatial.abs_postcode_lga.postcode 195 Using where
I've tried using index hints and I've got it down to 11 seconds, but I thought in the past I had been able to get it to perform in 2-3 seconds
With index hints (~11-17 seconds)
SELECT lga_name,lga_code,abs_postcode_lga.postcode,ratio,
COUNT(DISTINCT participant_id),
COUNT(DISTINCT participant_id)*ratio AS postcode_total,
COUNT(DISTINCT CASE WHEN `year` = 2010 THEN participant_id END)*ratio as postcode_total_start_year,
COUNT(DISTINCT CASE WHEN `year` = 2013 THEN participant_id END)*ratio as postcode_total_end_year
FROM ssa_participant USE INDEX (`year`,`sport_id_fk`)
LEFT JOIN abs_postcode_lga ON abs_postcode_lga.postcode = ssa_participant.postcode
WHERE sport_id_fk = 18
AND `year` >= 2010
AND `year` <= 2013
AND abs_postcode_lga.postcode IS NOT NULL
GROUP BY lga_name,lga_code,abs_postcode_lga.postcode,ratio
EXPLAIN
1 SIMPLE ssa_participant ref year,sport_id_fk sport_id_fk 1 const 1202214 Using where; Using temporary; Using filesort
1 SIMPLE abs_postcode_lga ref PRIMARY PRIMARY 14 sports_spatial.ssa_participant.postcode 1 Using where
Table structures
Can anyone spot some efficiency gains?
Structure
abs_post_lga
CREATE TABLE `abs_postcode_lga` (
`postcode` varchar(4) NOT NULL,
`lga_code` varchar(5) CHARACTER SET latin1 NOT NULL,
`lga_name` varchar(100) CHARACTER SET latin1 NOT NULL,
`ratio` decimal(10,8) NOT NULL,
PRIMARY KEY (`postcode`,`lga_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
4049 rows
ssa_participant
CREATE TABLE `ssa_participant` (
`sport_id_fk` tinyint(4) NOT NULL,
`year` varchar(4) NOT NULL DEFAULT '',
`participant_id` varchar(15) NOT NULL DEFAULT '',
`date_of_birth` date DEFAULT NULL,
`age` tinyint(4) DEFAULT NULL,
`sex` varchar(6) DEFAULT NULL,
`postcode` varchar(4) DEFAULT NULL,
`club` varchar(70) DEFAULT '',
`program` varchar(70) DEFAULT '',
KEY `year` (`year`) USING BTREE,
KEY `participant_id` (`participant_id`) USING BTREE,
KEY `sport_id_fk` (`sport_id_fk`) USING BTREE,
KEY `postcode` (`postcode`) USING BTREE,
KEY `club` (`club`) USING BTREE,
KEY `program` (`program`) USING BTREE,
KEY `age` (`age`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
~2.5 million rows