LISTSERV 16.5 - CODE4LIB Archives

I think you start with a smaller set, but then when you find
idiosyncratic records that were NOT represented in your smaller set, you
add representative samples to the sample set. The sample set organically
grows.

Certainly at some point you've got to test on a larger set too. But I
think there's a lot of value in having a small test set too. Of course,
it is something of a challenge to even come up with a reasonably
representative small set. But it doesn't need to be absolutely
representative---when you find examples not represented, you add them.
It grows.

Jonathan

Kyle Banerjee wrote:
>> According to the combined brainstorming of Jonathan Rochkind and
>> myself, the ideal record set should:
>>
>> 1. contain about 10k records, enough to really see the features, but
>> small enough that you could index it in a few minutes on a typical
>> desktop...
>>
>
>
>> 5. contain a distribution of typical errors one might encounter with
>> marc records in the wild
>>
>
> This is much harder to do than might appear on the surface. 10K is a
> really small set, and the issue is that unless people know how to
> create a set that has really targets the problem areas, you will
> inevitably miss important stuff. At the end of the day, it's the
> screwball stuff you didn't think about that always causes the most
> problems. I think such data sizes are useful for testing interfaces,
> but not for determining catalog behavior and setup.
>
> Despite the indexing time, I believe in testing with much larger sets.
> There are certain very important things that just can't be examined
> with small sets. For example, one huge problem with catalog data is
> that the completeness and quality is highly variable. When we were
> experimenting sometime back, we found that how you normalize the data
> and how you weight terms as well as documents has an enormous impact
> on search results and that unless you do some tuning, you will
> inevitably find a lot of garbage too close to the top with a bunch of
> good stuff ranked so low it isn't found.
>
> kyle
>
>

--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu