At Wed, 28 Sep 2011 12:41:26 -0400, Ken Irwin wrote: > > Hi all, > > I've not done much with MySQL subqueries, and I'm trying right now with what I find to be surprising results. I wonder if someone can help me understand. > > I have a pile of data that with columns for "institution" and "date". Institution gets repeated a lot, with many different dates. I want to select all the institutions that *only* have dates after July 1 and don't appear in the table before that. My solution was to do a first query for all the institutions that DO have dates before July 1 > SELECT distinct institution FROM `renewals` WHERE snap_date < '2011-07-01' > > And then to do a SELECT query on all the institutions: > SELECT distinct institution from renewals > > And then try to do a NOT IN subquery subtracting the smaller query from the larger one: > > SELECT distinct institution from renewals > WHERE institution not in > (SELECT distinct institution FROM `renewals` WHERE snap_date < '2011-07-01') > > ...only it doesn't seem to work. Or rather, the query has been running for several minutes and never comes back with an answer. Each of these two queries takes just a few milliseconds to run on its own. > > Can someone tell me (a) am I just formatting the query wrong, (b) do subqueries like this just take forever, and/or (c) is there a better way to do this? (I don't really understand about JOIN queries, but from what I can tell they are only for mixing the results of two different tables so I think they might not apply here.) > > Any advice would be most welcome. Not sure about the length of the combined query. One reason might be that if m denotes the total number of rows in `renewals' and n denotes the number of rows in the subquery, then AFAIK the MySQL needs to perform m x n comparisions. If I am not mistaken there might be room for simplifying the entire query. For the sake of simplicity first drop the DISTINCT constraint and consider `snap_date' to not contain NULL values. Now we define: - M is the set of all rows in `renewals' - A is the subset of all rows in `renewals' created by the subquery's WHERE claus We know some things about A and M: A is a subset of M. The WHERE clause divides M into two subsets: One set of rows that fullfill the WHERE clause (set A) and one set of rows that fullfill the complement of set A's WHERE clause (set B). The intersection of A and B is the empty set: Either a `snap_date' is before 2011-07-01 or it is equal to or after 2011-07-01. This translates to: Each row in `renewals' is either in set A or in set B. A row that is not in set A is in set B and vice versa. If this holds true, you can simplify the entire query from: SELECT institution from renewals WHERE institution *not in A* to: SELECT institution from renewals WHERE institution *in B* Second step. Which `institution' fields of the entire table (outer query) are in set B (inner query)? Well: The `institution' fields of all rows in set B. Or to put it in another way: If you take each `institution' from the entire table and check if it is in the subset of rows created by the inner query you will find that every `institution' whose `snap_date' fullfills the inner query's WHERE claus is in the set created by the inner query. I.e.: You can drop the outer query: SELECT institution FROM renewals WHERE `snap_date' >= '2011-07-01' Finally put back the DISTINCT constraint. SELECT DISTINCT institution FROM renewals WHERE `snap_date' >= '2011-07-01' Does this make sense? Btw. Stanford started an online course on "Introduction to Databases" at http://www.db-class.org -- you can access the online tutorials without beeing rolled-in at http://www.db-class.org/course/video/preview_list HTH, -- David -- David Maus Projekt HAB 2.0 Herzog August Bibliothek - D-38299 Wolfenbuettel Phone: +49-5331-808-379 Email: [log in to unmask] Github: http://github.com/dmj