> According to the combined brainstorming of Jonathan Rochkind and > myself, the ideal record set should: > > 1. contain about 10k records, enough to really see the features, but > small enough that you could index it in a few minutes on a typical > desktop... > 5. contain a distribution of typical errors one might encounter with > marc records in the wild This is much harder to do than might appear on the surface. 10K is a really small set, and the issue is that unless people know how to create a set that has really targets the problem areas, you will inevitably miss important stuff. At the end of the day, it's the screwball stuff you didn't think about that always causes the most problems. I think such data sizes are useful for testing interfaces, but not for determining catalog behavior and setup. Despite the indexing time, I believe in testing with much larger sets. There are certain very important things that just can't be examined with small sets. For example, one huge problem with catalog data is that the completeness and quality is highly variable. When we were experimenting sometime back, we found that how you normalize the data and how you weight terms as well as documents has an enormous impact on search results and that unless you do some tuning, you will inevitably find a lot of garbage too close to the top with a bunch of good stuff ranked so low it isn't found. kyle