There's another source of data for training library relevance ranking that I don't think has been exploited much yet. (for academic libraries) Searches against catalogs are usually intended to locate material to fill a specific information need. Often this information seeking results in circulation events. Many systems can identify the person who conducted a search session. Comparing the search results to actual checkout events might be fruitful. For example, if a search for certain keywords resulted in checkout events for items other than those listed, but within shelf browsing distance, there may be a strong relationship between the words and the information need satisfied by those concepts. Incidentally, this is the kind of association that would me much easier to find if the LCSH hierarchy hadn't been so badly mangled by computer. If the hierarchy were intact it would be possible to aggregate subjects to deal with the sparseness of the circulation events. Note that getting hold of this data may require working with central IT (e.g. If the library only has ip addresses, and the holder of that ip address at that time is known only via dhcp logs; or if the computers in the library require login, those logs may not be accessible to library systems staff directly.) This kind of work should also go through the IRB, even if their approval is not explicitly required. They may have good ideas for avoiding possible privacy violations. Simon p.s. If the physical layout of the library is known, you could also estimate scan radius. You could also calculate, based on checkouts of items seemingly unrelated to the search, from shelves passed on the way to the elevator, how to generate artificial serendipity by randomly throwing a few such items into the search results. Simon