> According to the combined brainstorming of Jonathan Rochkind and
> myself, the ideal record set should:
>
> 1. contain about 10k records, enough to really see the features, but
> small enough that you could index it in a few minutes on a typical
> desktop...
> 5. contain a distribution of typical errors one might encounter with
> marc records in the wild
This is much harder to do than might appear on the surface. 10K is a
really small set, and the issue is that unless people know how to
create a set that has really targets the problem areas, you will
inevitably miss important stuff. At the end of the day, it's the
screwball stuff you didn't think about that always causes the most
problems. I think such data sizes are useful for testing interfaces,
but not for determining catalog behavior and setup.
Despite the indexing time, I believe in testing with much larger sets.
There are certain very important things that just can't be examined
with small sets. For example, one huge problem with catalog data is
that the completeness and quality is highly variable. When we were
experimenting sometime back, we found that how you normalize the data
and how you weight terms as well as documents has an enormous impact
on search results and that unless you do some tuning, you will
inevitably find a lot of garbage too close to the top with a bunch of
good stuff ranked so low it isn't found.
kyle
|