I'd be interested in collecting anyone's enum-chron parsers they're willing to make available; there're got to be a lot of duplicated effort out there and between everyone, there's gotta be *something* that'll work halfway-decently. Anyone? Anyone? Direct to me at [log in to unmask]; I'll post back to the list when I get something worth posting. On Sat, Jan 28, 2012 at 12:22 PM, David Fiander <[log in to unmask]> wrote: > Stephen, regarding the question of ambiguity about chronology vs > enumeration, this is what I did with my parser: > > # If items are identified by chronology only, with no separate > # enumeration (eg, a newspaper issue), then the chronology is > # recorded in the enumeration subfields $a - $f. We can tell > # that this is the case if there are $a - $f subfields and no > # chronology subfields ($i-$k), and none of the $a-$f subfields > # have associated $u or $v subfields, but there's a $w and no $x > > So, if there are ONLY enumeration fields, and none of the enumeration > fields have corresponding frequency or continuity indicators, AND there's a > publication frequency but no indication of when in the calendar the highest > level of enumeration changes, THEN the enumerations are really chronology. > > Of course, this will still get certain patterns wrong, but it's the best > one can do. > > > On Sat, Jan 28, 2012 at 11:37, Stephen Meyer <[log in to unmask] > >wrote: > > > War is hell, right? Lately we have been dealing with a particular > > combination of two circles of the metadata Inferno: the first (limbo) and > > sixth (heresy): > > > > The limbo I'll define as a poorly designed metadata spec: the MARC > > holdings standard. The poor design in question is the ambiguity of > > enumeration/chronology subfield assignment, specifically this rule: > > > > When only chronology is used on an item (that is, the item > > carries no enumeration), the chronology is contained in the > > relevant enumeration subfield ($a-$h) instead of the chronology > > subfields ($i-$m). > > http://www.loc.gov/marc/**holdings/hd863865.html< > http://www.loc.gov/marc/holdings/hd863865.html> > > > > This means that as a programmer trying to parse enumeration and > chronology > > data from our holdings data *that uses a standard* I cannot reliably know > > that a subfield which has been defined as containing "First level of > > enumeration" will in fact contain enumeration rather than chronology. > > What's a programmer to do? Limbo, limbo. > > > > Others in this thread have already described the common heresy involved > in > > MARC cataloging: embedding data in a record intended for a single > > institution, or worse, a specific OPAC. > > > > Due to the ambiguity in the spec and the desire to just make it look the > > way I want it to look in my OPAC, the temptation is simply too great. In > > the end, we have data that couldn't possibly meet the standard as it is > > described and means that we spend more time than we expected parsing it > in > > the next system. > > > > In our case we work through these issues with an army of code tests. Our > > catalogers and reference staff find broken examples of MARC holdings data > > parsing in our newest discovery system, we gather the real-world MARC > > records as a test data set and then we write a bunch of Rspec tests so we > > don't undo previous bug fixes as we deal with the current ones. The > > challenge is coming up with a fast and responsive mechanism/process for > > adding a record to the test set once identified. > > > > -Steve > > > > Bess Sadler wrote, On 1/27/12 8:26 PM: > > > > I remember the "required field" operation of... aught six? aught seven? > >> It all runs together at my age. Turns out, for years people had been > making > >> shell catalog records for items in the collection that needed to be > checked > >> out but hadn't yet been barcoded. Some percentage of these people opted > not > >> to record any information about the item other than the barcode it left > the > >> building under, presumably because they were "in a hurry". If there was > >> such a thing as a metadata crime, that'd be it. > >> > >> We were young and naive, we thought "why not just index all our catalog > >> records into solr?" Little did we know what unholy abominations we would > >> uncover. Out of nowhere, we were surrounded by zombie marc records, > >> horrible half-created things, never meant to roam the earth or even to > >> exist in a sane mind. They could tell us nothing about who they were, > what > >> book they had once tried to describe, they could only stare blankly and > >> repeat in mangled agony "required field!" "required field!" "required > >> field!" over and over… > >> > >> It took us weeks to put them all out of their misery. > >> > >> This is the first time I've ever spoken of this publicly. The support > >> group is helping with the nightmares, but sometimes still, I wake in a > cold > >> sweat, wondering… did we really find them all????? > >> > >> > >> On Jan 27, 2012, at 4:28 PM, Ethan Gruber wrote: > >> > >> EDIT ME!!!! > >>> > >>> http://ead.lib.virginia.edu/**vivaxtf/view?docId=uva-sc/** > >>> viu00888.xml;query=;brand=**default#adminlink< > http://ead.lib.virginia.edu/vivaxtf/view?docId=uva-sc/viu00888.xml;query=;brand=default#adminlink > > > >>> > >>> On Fri, Jan 27, 2012 at 6:26 PM, Roy Tennant<[log in to unmask]> > >>> wrote: > >>> > >>> Oh, I should have also mentioned that some of the worst problems occur > >>>> when people treat their metadata like it will never leave their > >>>> institution. When that happens you get all kinds of crazy cruft in a > >>>> record. For example, just off the top of my head: > >>>> > >>>> * Embedded HTML markup (one of my favorites is an<img> tag) > >>>> * URLs to remote resources that are hard-coded to go through a > >>>> particular institution's proxy > >>>> * Notes that only have meaning for that institution > >>>> * Text that is meant to display to the end-user but may only do so in > >>>> certain systems; e.g., "Click here" in a particular subfield. > >>>> > >>>> Sigh... > >>>> Roy > >>>> > >>>> On Fri, Jan 27, 2012 at 4:17 PM, Roy Tennant<[log in to unmask]> > >>>> wrote: > >>>> > >>>>> Thanks a lot for the kind shout-out Leslie. I have been pondering > what > >>>>> I might propose to discuss at this event, since there is certainly > >>>>> plenty of fodder. Recently we (OCLC Research) did an investigation of > >>>>> 856 fields in WorldCat (some 40 million of them) and that might prove > >>>>> interesting. By the time ALA rolls around there may something else > >>>>> entirely I could talk about. > >>>>> > >>>>> That's one of the wonderful things about having 250 million MARC > >>>>> records sitting out on a 32-node cluster. There are any number of > >>>>> potentially interesting investigations one could do. > >>>>> Roy > >>>>> > >>>>> On Thu, Jan 26, 2012 at 2:10 PM, Johnston, Leslie<[log in to unmask]> > >>>>> > >>>> wrote: > >>>> > >>>>> Roy's fabulous "Bitter Harvest" paper: > >>>>>> > >>>>> http://roytennant.com/bitter_**harvest.html< > http://roytennant.com/bitter_harvest.html> > >>>> > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: Code for Libraries [mailto:[log in to unmask]**EDU< > [log in to unmask]>] > >>>>>> On Behalf > >>>>>> > >>>>> Of Walter Lewis > >>>> > >>>>> Sent: Wednesday, January 25, 2012 1:38 PM > >>>>>> To: [log in to unmask] > >>>>>> Subject: Re: [CODE4LIB] Metadata war stories... > >>>>>> > >>>>>> On 2012-01-25, at 10:06 AM, Becky Yoose wrote: > >>>>>> > >>>>>> - Dirty data issues when switching discovery layers or using > >>>>>>> legacy/vendor metadata (ex. HathiTrust) > >>>>>>> > >>>>>> > >>>>>> I have a sharp recollection of a slide in a presentation Roy Tennant > >>>>>> > >>>>> offered up at Access (at Halifax, maybe), where he offered up a > range > >>>> of > >>>> dates extracted from an array of OAI harvested records. The good, the > >>>> bad, > >>>> the incomprehensible, the useless-without-context (01/02/03 anyone?) > >>>> and on > >>>> and on. In my years of migrating data, I've seen most of those > >>>> variants. > >>>> (except ones *intended* to be BCE). > >>>> > >>>>> > >>>>>> Then there are the fielded data sets without authority control. My > >>>>>> > >>>>> favourite example comes from staff who nominally worked for me, so > I'm > >>>> not > >>>> telling tales out of school. The classic Dynix product had a > Newspaper > >>>> index module that we used before migrating it (PICK migrations; such a > >>>> joy). One title had twenty variations on "Georgetown Independent" (I > >>>> wish > >>>> I was kidding) and the dates ranged from the early ninth century until > >>>> nearly the 3rd millenium. (apparently there hasn't been much change in > >>>> local council over the centuries). > >>>> > >>>>> > >>>>>> I've come to the point where I hand-walk the spatial metadata to > links > >>>>>> > >>>>> with to geonames.org for the linked open data. Never had to do it > for > >>>> a > >>>> set with more than 40,000 entries though. The good news is that it > >>>> isn't > >>>> hard to establish a valid additional entry when one is required. > >>>> > >>>>> > >>>>>> Walter > >>>>>> > >>>>> > >>>> > -- Bill Dueber Library Systems Programmer University of Michigan Library