LISTSERV 16.5 - CODE4LIB Archives

>This is much harder to do than might appear on the surface. 10K is a
>really small set, and the issue is that unless people know how to
>create a set that has really targets the problem areas, you will
>inevitably miss important stuff. At the end of the day, it's the
>screwball stuff you didn't think about that always causes the most
>problems. I think such data sizes are useful for testing interfaces,
>but not for determining catalog behavior and setup.

Sounds like you have some experience of this, Kyle!
Do you have a list of "the screwball stuff"? Even an offhand one would
be interesting...

>Despite the indexing time, I believe in testing with much larger sets.
>There are certain very important things that just can't be examined
>with small sets. For example, one huge problem with catalog data is
>that the completeness and quality is highly variable. When we were
>experimenting sometime back, we found that how you normalize the data
>and how you weight terms as well as documents has an enormous impact
>on search results and that unless you do some tuning, you will
>inevitably find a lot of garbage too close to the top with a bunch of
>good stuff ranked so low it isn't found.

What sorts of normalizations do you do? I'm starting to look for
standard measures of data quality or data validation/normalization
routines for MARC.

I ask because in my experiments with the FRBR Display Tool, I've found
sorts of variations you describe, and I'd like to experiment with more
data validation & normalization. I'm very new at working with MARC data,
so even pointers to standard stuff would be really helpful!

-Jodi

Jodi Schneider
Science Library Specialist
Amherst College
413-542-2076