MySQL Forums
Forum List  »  Perl

counting and subselects
Posted by: Joseph Kalfsbeek
Date: September 19, 2007 04:48PM

Not quite sure which forum to post in, but this looks like a good start.

I created a thread in a different forum at: http://forums.devshed.com/mysql-help-4/obtaining-matching-rows-only-if-at-least-10-rows-match-475232.html#post1897816

You can find some table structures in that thread if you want to see them.

Here is the basic issue
1) Need to find sites that have X number of samples with medium_cd = H
2) Need to find sites that have X number of results (parameters) in the group_cd of C

A sample can have multiple results.
I can easily find sites meeting the first criteria. The problem I run into is that by implementing the second criteria, that can remove matching samples because the sample doesn't have enough results, which in turn can cause a site to not match the first criteria because it no longer has X samples.

I have tried doing this query in two separate queries, and by doing one large subselect query. In both cases it is not working.

If I cannot find some way to do this in a query, I am looking at trying some type of perl coding to first get the sites and samples meeting criteria 1, then doing the same for criteria 2, then do a comparison in Perl to see what sites will meet both criteria.


Here is the query presently. Its a 2 parter in that the first query finds all of the sites that have at least X number of samples with a medium code of H. Those sites are then used in the second query to find the sites that have X number of results with a parameter code in the group of C.

Right off, I have found a flaw in this logic in that the first query does return a valid list of sites, but when they go into the second query, the resultant set of sites returned is no longer really bound by the first query of needing X number of samples with a medium of H. For example: One site has 10 samples matching the full criteria, another 14, and another 17. However, by just looking at the partial criteria of X number of samples with medium code of H, those same 3 sites have 48, 87 and 89 samples matching, so they are always returned by the first query.



First SQL:
SELECT qwsam.site_id, qwsam.sample_id FROM QW_SAMPLE qwsam, SITEFILE s
WHERE ( s.site_id = qwsam.site_id )
AND ( qwsam.sample_web_cd = 'Y' )
AND (s.state_cd = '12' OR s.district_cd = '122' OR s.district_cd = '123' OR s.district_cd = '124' OR s.district_cd = '125' )
AND (s.site_web_cd = 'Y')
AND (qwsam.sample_start_dt >= '2000-01-01')
AND (qwsam.sample_start_dt <= '2007-09-14')
AND (qwsam.medium_cd = 'H')
GROUP BY site_id HAVING count(*) >= 10


Second SQL:
SELECT DISTINCT SQL_CALC_FOUND_ROWS s.*, p.rt_bol,
p.discharge_begin_date, p.discharge_end_date, p.discharge_count_nu, p.peak_begin_date, p.peak_end_date,
p.peak_count_nu, p.qw_begin_date, p.qw_end_date, p.qw_count_nu, p.gw_begin_date, p.gw_end_date, p.gw_count_nu,
p.meas_q_count_nu, p.meas_q_begin_date, p.meas_q_end_date, p.por_md
FROM POR p
LEFT JOIN SITEFILE s using (site_id)LEFT JOIN QW_POR qwpor using (site_id)
LEFT JOIN QW_SAMPLE qwsam using (site_id) LEFT JOIN QW_RESULT qwres using (sample_id) JOIN QW_PARAM_GROUP qpg
WHERE (s.state_cd = '12' OR s.district_cd = '122' OR s.district_cd = '123' OR s.district_cd = '124' OR s.district_cd = '125' )
AND (s.site_web_cd = 'Y')
AND (qwres.site_id in (331856,332142,334550))
AND ( qpg.parameter_cd = qwres.parameter_cd)
AND ( qpg.group_cd in ('C'))
AND ( qwres.dqi_cd in ('A','S','R'))
AND ( qwres.result_web_cd = 'Y' )
GROUP BY s.site_id HAVING count(*) >= 10 LIMIT 0, 20000


A few notes. We have tried to use a subselect to join both QW_SAMPLE and QW_RESULT and do this in one query, but the performance for that was not acceptable, nor did it return the corect results anyways.
Here is the subselect query we tried:

SELECT DISTINCT SQL_CALC_FOUND_ROWS s.site_id, s.site_no , p.rt_bol, p.qw_begin_date, p.qw_end_date,
p.qw_count_nu, p.por_md, qwsam.medium_cd, qwsam.sample_start_dt, count(*) as myCount
FROM POR p
LEFT JOIN SITEFILE s using (site_id)
LEFT JOIN QW_SAMPLE qwsam using (site_id)
LEFT JOIN QW_POR qwpor using (site_id)
WHERE (s.site_web_cd = 'Y')
AND qwpor.date_count_nu >= 10
AND ( qwsam.sample_web_cd = 'Y' )
AND (s.state_cd = '12' OR s.district_cd = '122' OR s.district_cd = '123' OR s.district_cd = '124' OR s.district_cd = '125' )
AND (p.qw_count_nu >= 10)
AND (p.qw_end_date >= '2000-01-01'
AND p.qw_begin_date <= '2007-01-01')
AND qwsam.medium_cd in ('H')
AND qwsam.sample_start_dt >='2000-01-01'
AND qwsam.sample_start_dt <= '2007-07-25'
AND qwsam.sample_id in (select qwres.sample_id from QW_RESULT qwres, QW_PARAM_GROUP qpg
WHERE (qpg.parameter_cd = qwres.parameter_cd)
AND ( qpg.group_cd in ('C'))
AND ( qwres.result_web_cd = 'Y' )
AND ( qwres.dqi_cd in ('A','S','R')) )

group by s.site_id having myCount > 10

We did find an web page that seemed to indicate subselects do not use indexes .. which if true could point to why they perform so slow
http://allgeekallthetime.blogspot.com/2006/08/mysql-5-views-vs-subselects.html

Right now we are also looking at creating a view to do some of this kind of processing, but I am not sure that is the way to go yet.

Hopefully this makes some sense .. basically I want to develop a query that returns me the sites that have X number of samples with a medium_cd = H, X number of results of a parameter_cd which has a group_cd = C and is within the date range specified (the other stuff like *_web_cd and dqi are just constraints on if that data should be displayed or not).

If you look at the thread I posted at the top I have simplified the query in the hopes that would make it easier to comprehend.

If anyone has thoughts or things to explore, please let me know.

Thanks



Edited 1 time(s). Last edit at 09/20/2007 12:24PM by Joseph Kalfsbeek.

Options: ReplyQuote


Subject
Written By
Posted
counting and subselects
September 19, 2007 04:48PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.