> Sounds like you have some experience of this, Kyle!
> Do you have a list of "the screwball stuff"? Even an offhand one would
> be interesting...
I don't have the list with me, but just to rattle a few things off,
some extra short records rank high because so much of a search term
matches the whole document. Some records contain fields that have been
repeated many times which artificially boosts them. You'll see
nonstandard use of fields as well as foreign character sets. There are
a number of ways URLs are displayed. Deriving format can be
problematic because encoded mat type and what you're providing access
to are different. Some records contain lots of added entries, while
many important ones are fairly minimalist. There are conversion on the
fly, purchased record sets, automatically generated ones, full level,
ones that automatically have some subject heading added that contains
a common search term. There are a zillion other things, but you get
the idea.
> What sorts of normalizations do you do? I'm starting to look for
> standard measures of data quality or data validation/normalization
> routines for MARC.
Index terms only once. This helps deal with repeated terms in
repetitive subject headings and added entries. Look at presence of
fields to assign additional material types (particularly useful for
electronic resources since these typically have paper records -- but
don't be fooled by links to TOC and stuff that's not full text). Give
special handling for serials. Keywords need to be weighted differently
depending on where they're from (e.g. title worth more than subject).
We also assigned a rough "record quality" score based presence/absence
of fields so that longer more complete records don't become less
important simply because a search term matches less of them than a
short record. Give a bit more weight to a true full text retrieval.
Number of libraries holding the item is considered. When indexing,
650|a is more important than |x, |y, or |z. Don't treat 650|z the same
way as 651|a. Recognize that 650|v is a form, that some common 650|x
fields should be treated this way (and neither should just be a
regular index term). The only thing we didn't use that a lot of places
put a lot of weight on is date -- this is good for retrieving popular
fiction, but you have to be really careful with it in academic
collections because it can hide classic stuff that's been around a
long time. I can't remember everything off the top of my head, but
there's a lot and it makes a big difference.
kyle
|