Ken,
For your performance issue, it would be interesting to take out the distinct on your subquery and see if that has any effect. You definitely want distinct on the outer query to reduce the results, but on the subquery version, it may be preventing the query optimizer from taking some other path that may improve performance.
If I were doing this, I'd probably write it using a HAVING clause, which allows you to apply a condition against the value of one of the aggregate functions such as MIN and MAX. With that, you should be able to use:
SELECT institution FROM renewals GROUP BY institution HAVING MIN(snap_date) >= '2011-07-01'
The "group by" reduces things to unique values, so no distinct is needed. If you have any values before that date, they will be the minimum value, and those should screen right out.
Chris Zagar
Librarian
Estrella Mountain Community College
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Ken Irwin
Sent: Wednesday, September 28, 2011 9:41 AM
To: [log in to unmask]
Subject: [CODE4LIB] mysql subquery response time
Hi all,
I've not done much with MySQL subqueries, and I'm trying right now with what I find to be surprising results. I wonder if someone can help me understand.
I have a pile of data that with columns for "institution" and "date". Institution gets repeated a lot, with many different dates. I want to select all the institutions that *only* have dates after July 1 and don't appear in the table before that. My solution was to do a first query for all the institutions that DO have dates before July 1
SELECT distinct institution FROM `renewals` WHERE snap_date < '2011-07-01'
And then to do a SELECT query on all the institutions:
SELECT distinct institution from renewals
And then try to do a NOT IN subquery subtracting the smaller query from the larger one:
SELECT distinct institution from renewals
WHERE institution not in
(SELECT distinct institution FROM `renewals` WHERE snap_date < '2011-07-01')
...only it doesn't seem to work. Or rather, the query has been running for several minutes and never comes back with an answer. Each of these two queries takes just a few milliseconds to run on its own.
Can someone tell me (a) am I just formatting the query wrong, (b) do subqueries like this just take forever, and/or (c) is there a better way to do this? (I don't really understand about JOIN queries, but from what I can tell they are only for mixing the results of two different tables so I think they might not apply here.)
Any advice would be most welcome.
Thanks
Ken
|