I have two ocean data sets (time series) in a MySQL DB:
Table A - Capture
cap_id| yr_c| sp_c| iso_c| area_c| qty_c | 5yr_avg_c |
3| 2015| TRR| 54| 8| 120 | 555.0 |
678| 2015| BOM| 62| 27| 0.0 | 30 |
20| 2015| TRR| 54| 27| 0.0 | 555.0 |
45| 2015| FRC| 7| 15| 86800 | 72000 |
Table B - Aquaculture
aqua_id| yr_a| sp_a| iso_a| area_a| qty_a | 5yr_avg_a |
78| 2015| OTS| 32| 27| 6868 | 7590
333| 2015| FRC| 7| 15| 550 | 634
789| 2015| TRR| 54| 27| 45000 | 47780
987| 2015| TRR| 32| 27| 40 | 47780
(n.b.: data for other years in the tables follows the same pattern)
I need to calculate total values and a moving average for a single species in a given year.
So first I combined Table A and B using a simulated FULL OUTER JOIN (left-right-union all) to generate Table C.
I then summed the qty_c and qty_a columns to get the global total (qty_t)
UPDATE global
SET qty_t = IFNULL(qty_c, 0) + IFNULL(qty_a, 0)
WHERE id > 0;
Table C - Global
id| cap_id| aqua_id| yr_c| yr_a| sp_c| sp_a| iso_c| iso_a|area_c|area_a| qty_c| qty_a | qty_t | 5yr_avg_t |
1 | 20| 789| 2015| 2015| TRR| TRR| 54| 54| 27| 27| 0.0| 45000 | 45000 |
2 | 45| 333| 2015| 2015| FRC| FRC| 7| 7| 15| 15| 86800| 550 | 87350 |
3 | 678| NULL| 2015| NULL| BOM| NULL| 62| NULL| 27| NULL| 0.0| NULL | 0.0 |
4 | 3| NULL| 2015| NULL| TRR| NULL| 54| NULL| 8| NULL| 120| NULL | 120 |
5 | NULL| 78| NULL| 2015| NULL| OTS| NULL| 32| NULL| 27| NULL| 6868 | 6868 |
6 | NULL| 987| NULL| 2015| NULL| TRR| NULL| 32| NULL| 27| NULL| 40 | 40 |
Where I get bogged is in calculating the moving average
A year and species value exists for every record, but when suming across five year sub-sets for a given species the effect of the variability in the data is that every species may not have a concurrent 5 year set of quantity values.
To calculate the 5-yr moving average for Table A and Table B I used:
Table A:
SELECT
x.cap_id,
x.sp_c,
x.yr_c,
AVG(y.qty_c) AS `5yr_avg_c`
FROM capture AS x
JOIN capture AS y ON x.sp_c = y.sp_c
AND x.yr_c BETWEEN y.yr_c AND y.yr_c + 5
AND yr_c > 0
AND x.sp_c <> 'NULL'
GROUP BY x.sp_c, x.yr_c DESC;
Table B:
SELECT
x.aqua_id,
x.sp_a,
x.yr_a,
AVG(y.qty_a) AS `5yr_avg_a`
FROM aquaculture AS x
JOIN aquculture AS y ON x.sp_a = y.sp_a
AND x.yr_a BETWEEN y.yr_a AND y.yr_a + 5
AND yr_a > 0
AND x.sp_a <> 'NULL'
GROUP BY x.sp_a, x.yr_a DESC;
My challenge now is that the five year moving average query I used for Tables A and B does not work for Table C
Initially I substituted qty_t for qty_c and qty_a respectively in the two queries
But the moving average calculated for the qty_t is very different for the same species in the same year depending on if I use (yr_c and sp_c) or (yr_a and sp_a) columns in Table C.
So then I tried:
SELECT
x.id,
x.sp_c,
x.yr_c,
AVG(y.qty_t) AS `5yr_avg_t`
FROM global AS x
JOIN global AS y ON x.sp_c = y.sp_c
AND x.yr_c BETWEEN y.yr_c AND y.yr_c + 5
AND yr_c > 0
AND x.sp_a <> 'NULL'
WHERE x.yr_c = x.yr_a
OR x.yr_a IS NULL
GROUP BY x.sp_c, x.yr_c DESC;
This gets me an average for every Table A record in Table C consistent with the five year average calculated in Table A
To get the average for the remaining Table B records in Table C (but avoid calculating inconsistent averages) I thought I could use the query:
SELECT
x.id,
x.sp_c,
x.yr_c,
AVG(y.qty_t) AS `5yr_avg_t`
FROM global AS x
JOIN global AS y ON x.sp_c = y.sp_c
AND x.yr_c BETWEEN y.yr_c AND y.yr_c + 5
AND yr_c > 0
AND x.sp_c <> 'NULL'
WHERE x.yr_c <> x.yr_a
AND x.yr_a IS NOT NULL
AND x.yr_c IS NULL
GROUP BY x.sp_c, x.yr_c DESC;
But really not confident I have the right solution for the combined Table C data?
Or if a different table structure is what I need?
My other challenge is that I do not exactly understand the syntax "year_c > 0" and "year_a > 0"
Is it saying:
"if there is no quantity record for a given species in a given year in the five year sequence assume the catch value = 0.
Or is it saying:
"assume there is no catch value"?
I need to understand this because the demoninator matters.
If the average operation is dividing by 5 irrespective of whether there are 5 sequential years - the average will be lower.
If it is dividing by the number of years in the five year sequence where there is a year value (e.g. only 3 of the 5 years in a five year sequence have a quantity value) the 5 year average is actually a 3 year one - in which case the average will be higher.
Help very welcomed, Thanks