LISTSERV 16.5 - CODE4LIB Archives

At Wed, 28 Sep 2011 12:41:26 -0400,
Ken Irwin wrote:
>
> Hi all,
>
> I've not done much with MySQL subqueries, and I'm trying right now with what I find to be surprising results. I wonder if someone can help me understand.
>
> I have a pile of data that with columns for "institution" and "date". Institution gets repeated a lot, with many different dates. I want to select all the institutions that *only* have dates after July 1 and don't appear in the table before that. My solution was to do a first query for all the institutions that DO have dates before July 1
> SELECT distinct institution FROM `renewals` WHERE snap_date < '2011-07-01'
>
> And then to do a SELECT query on all the institutions:
> SELECT distinct institution from renewals
>
> And then try to do a NOT IN subquery subtracting the smaller query from the larger one:
>
> SELECT distinct institution from renewals
> WHERE institution not in
> (SELECT distinct institution FROM `renewals` WHERE snap_date < '2011-07-01')
>
> ...only it doesn't seem to work. Or rather, the query has been running for several minutes and never comes back with an answer. Each of these two queries takes just a few milliseconds to run on its own.
>
> Can someone tell me (a) am I just formatting the query wrong, (b) do subqueries like this just take forever, and/or (c) is there a better way to do this? (I don't really understand about JOIN queries, but from what I can tell they are only for mixing the results of two different tables so I think they might not apply here.)
>
> Any advice would be most welcome.


Not sure about the length of the combined query. One reason might be
that if m denotes the total number of rows in `renewals' and n denotes
the number of rows in the subquery, then AFAIK the MySQL needs to
perform m x n comparisions.

If I am not mistaken there might be room for simplifying the entire
query.

For the sake of simplicity first drop the DISTINCT constraint and
consider `snap_date' to not contain NULL values. Now we define:

- M is the set of all rows in `renewals'
- A is the subset of all rows in `renewals' created by the subquery's
  WHERE claus

We know some things about A and M:

A is a subset of M.

The WHERE clause divides M into two subsets: One set of rows that
fullfill the WHERE clause (set A) and one set of rows that fullfill
the complement of set A's WHERE clause (set B).

The intersection of A and B is the empty set: Either a `snap_date' is
before 2011-07-01 or it is equal to or after 2011-07-01. This
translates to: Each row in `renewals' is either in set A or in set
B. A row that is not in set A is in set B and vice versa.

If this holds true, you can simplify the entire query from:

SELECT institution from renewals WHERE institution *not in A*

to:

SELECT institution from renewals WHERE institution *in B*

Second step. Which `institution' fields of the entire table (outer
query) are in set B (inner query)? Well: The `institution' fields of
all rows in set B.

Or to put it in another way: If you take each `institution' from the
entire table and check if it is in the subset of rows created by the
inner query you will find that every `institution' whose `snap_date'
fullfills the inner query's WHERE claus is in the set created by the
inner query.

I.e.: You can drop the outer query:

SELECT institution FROM renewals WHERE `snap_date' >= '2011-07-01'

Finally put back the DISTINCT constraint.

SELECT DISTINCT institution FROM renewals WHERE `snap_date' >= '2011-07-01'

Does this make sense?

Btw. Stanford started an online course on "Introduction to Databases"
at http://www.db-class.org -- you can access the online tutorials
without beeing rolled-in at http://www.db-class.org/course/video/preview_list

HTH,
  -- David
--
David Maus
Projekt HAB 2.0
Herzog August Bibliothek - D-38299 Wolfenbuettel
Phone: +49-5331-808-379
Email: [log in to unmask]
Github: http://github.com/dmj