>This is much harder to do than might appear on the surface. 10K is a >really small set, and the issue is that unless people know how to >create a set that has really targets the problem areas, you will >inevitably miss important stuff. At the end of the day, it's the >screwball stuff you didn't think about that always causes the most >problems. I think such data sizes are useful for testing interfaces, >but not for determining catalog behavior and setup. Sounds like you have some experience of this, Kyle! Do you have a list of "the screwball stuff"? Even an offhand one would be interesting... >Despite the indexing time, I believe in testing with much larger sets. >There are certain very important things that just can't be examined >with small sets. For example, one huge problem with catalog data is >that the completeness and quality is highly variable. When we were >experimenting sometime back, we found that how you normalize the data >and how you weight terms as well as documents has an enormous impact >on search results and that unless you do some tuning, you will >inevitably find a lot of garbage too close to the top with a bunch of >good stuff ranked so low it isn't found. What sorts of normalizations do you do? I'm starting to look for standard measures of data quality or data validation/normalization routines for MARC. I ask because in my experiments with the FRBR Display Tool, I've found sorts of variations you describe, and I'd like to experiment with more data validation & normalization. I'm very new at working with MARC data, so even pointers to standard stuff would be really helpful! -Jodi Jodi Schneider Science Library Specialist Amherst College 413-542-2076