On Wed, Oct 25, 2017 at 8:57 AM, Eric Lease Morgan <[log in to unmask]> wrote:
> ...My bibliographic data is fraught with inconsistencies. For example, a
> publisher’s name may be recorded one way, another way, or a third way. The
> same goes for things like publisher place: South Bend; South Bend, IN;
> South Bend, Ind. And then there is the ISBD punctuation that is sometimes
> applied and sometimes not. All of these inconsistencies make indexing &
> faceted browsing more difficult than it needs to be.
Effective normalizing is about understanding patterns that represent the
same thing and being aware of the patterns associated with specific types
of data. For example, in your publisher example, detecting geographic
entities and normalizing the states would be easy enough. You'll also see
variation in how the publisher names themselves are expressed but also that
the vast majority of variations follow a small number of patterns.
Don't be afraid to use multifield logic to normalize one or multiple
fields. To return to your publisher example, fragments from the publisher
name and place may be used to normalize both fields individually and
collectively more accurately than attempting to normalize each field in
What is the source of your bib data -- or are their many? You may be able
to use info such as byte 18 (descriptive cataloging form) in the Leader or
even the cat date to figure out cataloging rules that would have been in
play that drive patterns specific to those records. If you have multiple
sources of records, the patterns will most likely vary with the source,
e.g. there are multiple ways personal names can be expressed, but the
number of variations is small.
Depending on what you're working with, other clustering tools may be
helpful. However, you may get better and more predictable results using a
method tuned for the data you have than a much more sophisticated mechanism
created for other uses.