There's another source of data for training library relevance ranking that I
don't think has been exploited much yet.
(for academic libraries)
Searches against catalogs are usually intended to locate material to fill a
specific information need.
Often this information seeking results in circulation events.
Many systems can identify the person who conducted a search session.
Comparing the search results to actual checkout events might be fruitful.
For example, if a search for certain keywords resulted in checkout events
for items other than those listed, but within shelf browsing distance,
there may be a strong relationship between the words and the information
need satisfied by those concepts.
Incidentally, this is the kind of association that would me much easier to
find if the LCSH hierarchy hadn't been so badly mangled by computer. If the
hierarchy were intact it would be possible to aggregate subjects to deal
with the sparseness of the circulation events.
Note that getting hold of this data may require working with central IT
(e.g. If the library only has ip addresses, and the holder of that ip
address at that time is known only via dhcp logs; or if the computers in the
library require login, those logs may not be accessible to library systems
staff directly.) This kind of work should also go through the IRB, even if
their approval is not explicitly required. They may have good ideas for
avoiding possible privacy violations.
If the physical layout of the library is known, you could also estimate scan
radius. You could also calculate, based on checkouts of items seemingly
unrelated to the search, from shelves passed on the way to the elevator, how
to generate artificial serendipity by randomly throwing a few such items
into the search results.