Thanks. To clarify, income# comes from data1 and geo# comes from data2. And the subsample contains no id code, just companyid, since there could be several identical companyid variables (and therefore multiple id codes) within the "data#" tables. data# tables all have identical primary keys "id" and a companyid column, which is indexed. This last bit is redundant, obviously, but it's done for usability reasons.
table#.id <--1:1--> table#.id with the same number of rows
subsample.companyid <--1:n--> table#.companyid with subsample never containing a companyid that doesn't exist in table#.
To fill in your example:
SELECT data1.id, data1.companyid, data1.date, data1.income6,
data2.geo1,
data2.geo2
FROM subsample s
JOIN data1 ON s.companyid = data1.companyid
JOIN data2 ON data1.id=data2.id
So this optimal? With each data# table sometimes containing over 1000 variables, I don't really know what's best. One additional consideration is that I'd like to be able to create output from 3 or 4 data tables, so there would be a couple more joins at the end.
Someone suggested the following, but I don't know enough about the mechanics of what MySQL is doing to know. Is there any benefit to joining subsample to data1 and data2 separately and then joining the two outcomes?
SELECT * FROM
(SELECT id, subsample.companyid, date, income6
FROM subsample JOIN data1
ON subsample.companyid = data1.companyid) A
JOIN
(SELECT id as id2, geo1, geo2
FROM subsample JOIN data2
ON subsample.companyid = data2.companyid) B
ON A.id=B.id2
This may be pointless, but It's been difficult to benchmark.
Thanks for your help.