LISTSERV 16.5 - CODE4LIB Archives

I'd be interested in collecting anyone's enum-chron parsers they're willing
to make available; there're got to be a lot of duplicated effort out there
and between everyone, there's gotta be *something* that'll work
halfway-decently.

Anyone? Anyone? Direct to me at [log in to unmask]; I'll post back to the list
when I get something worth posting.

On Sat, Jan 28, 2012 at 12:22 PM, David Fiander <[log in to unmask]> wrote:

> Stephen, regarding the question of ambiguity about chronology vs
> enumeration, this is what I did with my parser:
>
> # If items are identified by chronology only, with no separate
> # enumeration (eg, a newspaper issue), then the chronology is
> # recorded in the enumeration subfields $a - $f.  We can tell
> # that this is the case if there are $a - $f subfields and no
> # chronology subfields ($i-$k), and none of the $a-$f subfields
> # have associated $u or $v subfields, but there's a $w and no $x
>
> So, if there are ONLY enumeration fields, and none of the enumeration
> fields have corresponding frequency or continuity indicators, AND there's a
> publication frequency but no indication of when in the calendar the highest
> level of enumeration changes, THEN the enumerations are really chronology.
>
> Of course, this will still get certain patterns wrong, but it's the best
> one can do.
>
>
> On Sat, Jan 28, 2012 at 11:37, Stephen Meyer <[log in to unmask]
> >wrote:
>
> > War is hell, right? Lately we have been dealing with a particular
> > combination of two circles of the metadata Inferno: the first (limbo) and
> > sixth (heresy):
> >
> > The limbo I'll define as a poorly designed metadata spec: the MARC
> > holdings standard. The poor design in question is the ambiguity of
> > enumeration/chronology subfield assignment, specifically this rule:
> >
> >  When only chronology is used on an item (that is, the item
> >  carries no enumeration), the chronology is contained in the
> >  relevant enumeration subfield ($a-$h) instead of the chronology
> >  subfields ($i-$m).
> >  http://www.loc.gov/marc/**holdings/hd863865.html<
> http://www.loc.gov/marc/holdings/hd863865.html>
> >
> > This means that as a programmer trying to parse enumeration and
> chronology
> > data from our holdings data *that uses a standard* I cannot reliably know
> > that a subfield which has been defined as containing "First level of
> > enumeration" will in fact contain enumeration rather than chronology.
> > What's a programmer to do? Limbo, limbo.
> >
> > Others in this thread have already described the common heresy involved
> in
> > MARC cataloging: embedding data in a record intended for a single
> > institution, or worse, a specific OPAC.
> >
> > Due to the ambiguity in the spec and the desire to just make it look the
> > way I want it to look in my OPAC, the temptation is simply too great. In
> > the end, we have data that couldn't possibly meet the standard as it is
> > described and means that we spend more time than we expected parsing it
> in
> > the next system.
> >
> > In our case we work through these issues with an army of code tests. Our
> > catalogers and reference staff find broken examples of MARC holdings data
> > parsing in our newest discovery system, we gather the real-world MARC
> > records as a test data set and then we write a bunch of Rspec tests so we
> > don't undo previous bug fixes as we deal with the current ones. The
> > challenge is coming up with a fast and responsive mechanism/process for
> > adding a record to the test set once identified.
> >
> > -Steve
> >
> > Bess Sadler wrote, On 1/27/12 8:26 PM:
> >
> >  I remember the "required field" operation of... aught six? aught seven?
> >> It all runs together at my age. Turns out, for years people had been
> making
> >> shell catalog records for items in the collection that needed to be
> checked
> >> out but hadn't yet been barcoded. Some percentage of these people opted
> not
> >> to record any information about the item other than the barcode it left
> the
> >> building under, presumably because they were "in a hurry". If there was
> >> such a thing as a metadata crime, that'd be it.
> >>
> >> We were young and naive, we thought "why not just index all our catalog
> >> records into solr?" Little did we know what unholy abominations we would
> >> uncover. Out of nowhere, we were surrounded by zombie marc records,
> >> horrible half-created things, never meant to roam the earth or even to
> >> exist in a sane mind. They could tell us nothing about who they were,
> what
> >> book they had once tried to describe, they could only stare blankly and
> >> repeat in mangled agony "required field!" "required field!" "required
> >> field!" over and over…
> >>
> >> It took us weeks to put them all out of their misery.
> >>
> >> This is the first time I've ever spoken of this publicly. The support
> >> group is helping with the nightmares, but sometimes still, I wake in a
> cold
> >> sweat, wondering… did we really find them all?????
> >>
> >>
> >> On Jan 27, 2012, at 4:28 PM, Ethan Gruber wrote:
> >>
> >>  EDIT ME!!!!
> >>>
> >>> http://ead.lib.virginia.edu/**vivaxtf/view?docId=uva-sc/**
> >>> viu00888.xml;query=;brand=**default#adminlink<
> http://ead.lib.virginia.edu/vivaxtf/view?docId=uva-sc/viu00888.xml;query=;brand=default#adminlink
> >
> >>>
> >>> On Fri, Jan 27, 2012 at 6:26 PM, Roy Tennant<[log in to unmask]>
> >>>  wrote:
> >>>
> >>>  Oh, I should have also mentioned that some of the worst problems occur
> >>>> when people treat their metadata like it will never leave their
> >>>> institution. When that happens you get all kinds of crazy cruft in a
> >>>> record. For example, just off the top of my head:
> >>>>
> >>>> * Embedded HTML markup (one of my favorites is an<img>  tag)
> >>>> * URLs to remote resources that are hard-coded to go through a
> >>>> particular institution's proxy
> >>>> * Notes that only have meaning for that institution
> >>>> * Text that is meant to display to the end-user but may only do so in
> >>>> certain systems; e.g., "Click here" in a particular subfield.
> >>>>
> >>>> Sigh...
> >>>> Roy
> >>>>
> >>>> On Fri, Jan 27, 2012 at 4:17 PM, Roy Tennant<[log in to unmask]>
> >>>>  wrote:
> >>>>
> >>>>> Thanks a lot for the kind shout-out Leslie. I have been pondering
> what
> >>>>> I might propose to discuss at this event, since there is certainly
> >>>>> plenty of fodder. Recently we (OCLC Research) did an investigation of
> >>>>> 856 fields in WorldCat (some 40 million of them) and that might prove
> >>>>> interesting. By the time ALA rolls around there may something else
> >>>>> entirely I could talk about.
> >>>>>
> >>>>> That's one of the wonderful things about having 250 million MARC
> >>>>> records sitting out on a 32-node cluster. There are any number of
> >>>>> potentially interesting investigations one could do.
> >>>>> Roy
> >>>>>
> >>>>> On Thu, Jan 26, 2012 at 2:10 PM, Johnston, Leslie<[log in to unmask]>
> >>>>>
> >>>> wrote:
> >>>>
> >>>>> Roy's fabulous "Bitter Harvest" paper:
> >>>>>>
> >>>>> http://roytennant.com/bitter_**harvest.html<
> http://roytennant.com/bitter_harvest.html>
> >>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Code for Libraries [mailto:[log in to unmask]**EDU<
> [log in to unmask]>]
> >>>>>> On Behalf
> >>>>>>
> >>>>> Of Walter Lewis
> >>>>
> >>>>> Sent: Wednesday, January 25, 2012 1:38 PM
> >>>>>> To: [log in to unmask]
> >>>>>> Subject: Re: [CODE4LIB] Metadata war stories...
> >>>>>>
> >>>>>> On 2012-01-25, at 10:06 AM, Becky Yoose wrote:
> >>>>>>
> >>>>>>  - Dirty data issues when switching discovery layers or using
> >>>>>>> legacy/vendor metadata (ex. HathiTrust)
> >>>>>>>
> >>>>>>
> >>>>>> I have a sharp recollection of a slide in a presentation Roy Tennant
> >>>>>>
> >>>>> offered up at Access  (at Halifax, maybe), where he offered up a
> range
> >>>> of
> >>>> dates extracted from an array of OAI harvested records.  The good, the
> >>>> bad,
> >>>> the incomprehensible, the useless-without-context (01/02/03 anyone?)
> >>>> and on
> >>>> and on.  In my years of migrating data, I've seen most of those
> >>>> variants.
> >>>> (except ones *intended* to be BCE).
> >>>>
> >>>>>
> >>>>>> Then there are the fielded data sets without authority control.  My
> >>>>>>
> >>>>> favourite example comes from staff who nominally worked for me, so
> I'm
> >>>> not
> >>>> telling tales out of school.  The classic Dynix product had a
> Newspaper
> >>>> index module that we used before migrating it (PICK migrations; such a
> >>>> joy).  One title had twenty variations on "Georgetown Independent" (I
> >>>> wish
> >>>> I was kidding) and the dates ranged from the early ninth century until
> >>>> nearly the 3rd millenium. (apparently there hasn't been much change in
> >>>> local council over the centuries).
> >>>>
> >>>>>
> >>>>>> I've come to the point where I hand-walk the spatial metadata to
> links
> >>>>>>
> >>>>> with to geonames.org for the linked open data. Never had to do it
> for
> >>>> a
> >>>> set with more than 40,000 entries though.  The good news is that it
> >>>> isn't
> >>>> hard to establish a valid additional entry when one is required.
> >>>>
> >>>>>
> >>>>>> Walter
> >>>>>>
> >>>>>
> >>>>
>



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library