Actually -- I'd disagree because that is a very narrow view of the specification. When validating MARC, I'd take the approach to validate structure (which allows you to then read any MARC format) -- then use a separate process for validating content of fields, which in my opinion, is more open to interpretation based on system usage of the data. For example, 22 and 23 are undefined values that local systems may very well have a practical need to define and use given that there are only so many values in the leader. This is why I sometimes see additional values in the 09 field (which should be a or blank) to define different character set types, or additional elements added to other fields. If I want to validate the content of those fields, I'd validate it through a different process -- but I separate the process from the validation of the structure -- because the two are not exclusive.
--TR
> -----Original Message-----
> From: Jonathan Rochkind [mailto:[log in to unmask]]
> Sent: Wednesday, April 06, 2011 9:59 AM
> To: Code for Libraries
> Cc: Reese, Terry
> Subject: Re: [CODE4LIB] MARC magic for file
>
> I'm not sure what you mean Terry. Maybe we have different understandings
> of "valid".
>
> If leader bytes 20-23 are not "4500", I suggest that is _by definition_ not a
> "valid" Marc21 file. It violates the Marc21 specification.
>
> Now, they may still be _usable_, by software that ignores these bytes
> anyway or works around them. We definitely have a lot of software that
> does that.
>
> Which can end up causing problems that remind me of very analagous
> problems caused by the early days of web browsers that felt like being
> 'tolerant' of bad data. "My html works in every web brower BUT this one,
> why not? Oh, becuase that's the only one that actually followed the
> standard, oops."
>
> I actually ran into an example of that problem with this exact issue.
> MOST software just ignores marc leader bytes 20-23, and assumes the
> semantics of "4500"---the only legal semantics for Marc21. But Marc4j
> actually _respected_ them, apparently the author thought that some marc in
> the wild might intentionally set different bytes here (no idea if that's true or
> not). So if the leader bytes 20-23 were "invalid"
> (according to the spec), Marc47 would suddenly decide that the "length of
> field portion" was NOT 4, but actually BELIEVE whatever was in leader byte
> 20, causing the record to be parsed improperly. And I had records like that
> coming out of my ILS (not even a vendor database). That was an unfun
> couple days of debugging to figure out what was going on.
>
> On 4/6/2011 12:52 PM, Reese, Terry wrote:
> > Actually, you can have records that are MARC21 coming out of vendor
> databases (who sometime embed control characters into the leader) and still
> be valid. Once you stop looking at just your ILS or OCLC, you probably
> wouldn't be surprised to know that records start looking very different.
> >
> > --TR
> >
> >
> > ********************************
> > Terry Reese, Associate Professor
> > Gray Family Chair
> > for Innovative Library Services
> > 121 Valley Libraries
> > Corvallis, Or 97331
> > tel: 541.737.6384
> > ********************************
> >
> >
> >
> >> -----Original Message-----
> >> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> >> Of Jonathan Rochkind
> >> Sent: Wednesday, April 06, 2011 9:44 AM
> >> To: [log in to unmask]
> >> Subject: Re: [CODE4LIB] MARC magic for file
> >>
> >> Can't you have a legal "MARC" file that does NOT have 4500 in those
> >> leader positions? It's just not legal "Marc21", right? Other marc
> >> formats may specify or even allow flexibility in the things these
> >> bytes
> >> specify:
> >>
> >> * Length of the length-of-field portion
> >> * Number of characters in the starting-character-position portion of
> >> a Directory entry
> >> * Number of characters in the implementation-defined portion of a
> >> Directory entry
> >>
> >> Or, um, 23, which is I guess is left to the specific Marc
> >> implementation (ie,
> >> Marc21 is one such) to use for it's own purposes.
> >>
> >> I have no idea how that should inform the 'marc magic'.
> >>
> >> Is mime-type application/marc defined as specifically Marc21, or as
> >> any Marc?
> >>
> >> Jonathan
> >>
> >> On 4/6/2011 12:28 PM, Ford, Kevin wrote:
> >>> Well, this brings us right up against the issue of files that adhere
> >>> to their
> >> specifications versus forgiving applications. Think of browsers and HTML.
> >> Suffice it to say, MARC applications are quite likely to be forgiving
> >> of leader positions 20-23. In my non-conforming MARC file and in
> >> Bill's, the leader positions 20-21 ("45") seemed constant, but things
> >> could fall apart for positions 22-23. So...
> >>> I present the following (in-line and attached, to preserve tabs) in
> >>> an
> >> attempt to straddle the two sides of this issue: applications
> >> forgiving of non- conforming files. Should the two characters
> >> following 45 (at position 20)
> >> *not* be 00, then the identification will be noted as
> >> "non-conforming." We could classify this as reasonable
> >> identification but hardly ironclad (indeed, simply checking to
> >> confirm that part of the first 24 positions match the specification hardly
> constitutes a robust identification, but it's something).
> >>> It will also give you a mimetype too, now.
> >>>
> >>> Would any like testing it out more fully on their own files?
> >>>
> >>>
> >>> #--------------------------------------------
> >>> # MARC 21 Magic (Third cut)
> >>>
> >>> # Set at position 0
> >>> 0 byte x
> >>>
> >>> # leader position 20-21 must be 45
> >>>> 20 string 45
> >>> # leader starts with 5 digits, followed by codes specific to MARC
> >>> format
> >>>>> 0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC
> Bibliographic
> >>> !:mime application/marc
> >>>>> 0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
> >>> !:mime application/marc
> >>>>> 0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings
> >>> !:mime application/marc
> >>>>> 0 regex/1 (^[0-9]{5})[acdn][w] MARC Classification
> >>> !:mime application/marc
> >>>>> 0 regex/1 (^[0-9]{5})[cdn][q] MARC Community
> >>> !:mime application/marc
> >>>
> >>> # leader position 22-23, should be "00" but is it?
> >>>>> 0 regex/1 (^.{21})([^0]{2}) (non-conforming)
> >>> !:mime application/marc
> >>>
> >>>
> >>> If this works, I'll see about submitting this copy. Thanks to all
> >>> your efforts
> >> already.
> >>> Warmly,
> >>>
> >>> Kevin
> >>>
> >>> --
> >>> Library of Congress
> >>> Network Development and MARC Standards Office
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ________________________________________
> >>> From: Code for Libraries [[log in to unmask]] On Behalf Of
> >> Simon
> >>> Spero [[log in to unmask]]
> >>> Sent: Sunday, April 03, 2011 14:01
> >>> To: [log in to unmask]
> >>> Subject: Re: [CODE4LIB] MARC magic for file
> >>>
> >>> I am pretty sure that the marc4j standard reader ignores them; the
> >>> tolerant reader definitely does. Otherwise JHU might have about two
> >>> parseable records based on the mangled leaders that J-Rock gets
> >>> stuck with :-)
> >>>
> >>> An analysis of the ~7M LC bib records from the scriblio.net data
> >>> files (~ Dec 2006) indicated that leader has less than 8 bits of
> >>> information in it (shannon-weaver definition). This excludes the
> >>> initial length value, which is redundant given the end of record marker.
> >>>
> >>>
> >>> The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC
> leader.
> >>> The final characters of the leader are "450".
> >>>
> >>> Also, I object to the phrase "decent MARC tool". Any tool capable
> >>> of dealing with MARC as it exists cannot afford the luxury of
> >>> decency :-)
> >>>
> >>> [ HA: "A clear conscience?"
> >>> BW: "Yes, Sir Humphrey."
> >>> HA: "When did you acquire this taste for luxuries?"]
> >>>
> >>> Simon
> >>>
> >>> On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens<[log in to unmask]>
> >> wrote:
> >>>> "I'm sure any decent MARC tool can deal with them, since decent
> >>>> MARC tools are certainly going to be forgiving enough to deal with
> >>>> four characters that apparently don't even really matter."
> >>>>
> >>>> You say that, but I'm pretty sure Marc4J throws errors MARC records
> >>>> where these characters are incorrect
> >>>>
> >>>> Owen
> >>>>
> >>>> On Fri, Apr 1, 2011 at 3:51 AM, William Denton<[log in to unmask]>
> wrote:
> >>>>
> >>>>> On 28 March 2011, Ford, Kevin wrote:
> >>>>>
> >>>>> I couldn't get Simon's MARC 21 Magic file to work. Among other
> >>>>> issues,
> >>>> I
> >>>>>> received "line too long" errors. But, since I've been curious
> >>>>>> about
> >>>> this
> >>>>>> for sometime, I figured I'd take a whack at it myself. Try this:
> >>>>>>
> >>>>> This is very nice! Thanks. I tried it on a bunch of MARC files I
> >>>>> have, and it recognized almost all of them. A few it didn't, so I
> >>>>> had a closer look, and they're invalid.
> >>>>>
> >>>>> For example, the Internet Archive's Binghamton catalogue dump:
> >>>>>
> >>>>> http://ia600307.us.archive.org/6/items/marc_binghamton_univ/
> >>>>>
> >>>>> $ file -m marc.magic bgm*mrc
> >>>>> bgm_openlib_final_0-5.mrc: data
> >>>>> bgm_openlib_final_10-15.mrc: MARC Bibliographic
> >>>>> bgm_openlib_final_15-18.mrc: data
> >>>>> bgm_openlib_final_5-10.mrc: MARC Bibliographic
> >>>>>
> >>>>> But why? Aha:
> >>>>>
> >>>>> $ head -c 25 bgm_openlib_final_*mrc
> >>>>> ==> bgm_openlib_final_0-5.mrc<==
> >>>>> 01812cas 2200457 45x00
> >>>>> ==> bgm_openlib_final_10-15.mrc<==
> >>>>> 01008nam 2200289ua 45000
> >>>>> ==> bgm_openlib_final_15-18.mrc<==
> >>>>> 01614cam 00385 45 0
> >>>>> ==> bgm_openlib_final_5-10.mrc<==
> >>>>> 00887nam 2200265v 45000
> >>>>>
> >>>>> As you say, the leader should end with 4500 (as defined at
> >>>>> http://www.loc.gov/marc/authority/adleader.html) but two of those
> >>>>> files don't. So they're not valid MARC. I'm sure any decent MARC
> >>>>> tool can
> >>>> deal
> >>>>> with them, since decent MARC tools are certainly going to be
> >>>>> forgiving enough to deal with four characters that apparently
> >>>>> don't even really matter.
> >>>>>
> >>>>> So on the one hand they're usable MARC but file wouldn't say so,
> >>>>> and on
> >>>> the
> >>>>> other that's a good indication that the files have failed a basic
> >>>> validity
> >>>>> test. I wonder if there are similar situations for JPEGs or MP3s.
> >>>>>
> >>>>> I think you should definitely submit this for inclusion in the
> >>>>> magic
> >>>> file.
> >>>>> It would be very useful for us all!
> >>>>>
> >>>>> Bill
> >>>>>
> >>>>> P.S. I'd never used head -c (to show a fixed number of bytes) before.
> >>>>> Always nice to find a new useful option to an old command.
> >>>>>
> >>>>>
> >>>>> #--------------------------------------------
> >>>>>> # MARC 21 Magic (Second cut)
> >>>>>>
> >>>>>> # Set at position 0
> >>>>>> 0 short>0x0000
> >>>>>>
> >>>>>> # leader ends with 4500
> >>>>>>
> >>>>>>> 20 string 4500
> >>>>>>>
> >>>>>> # leader starts with 5 digits, followed by codes specific to MARC
> >>>>>> format
> >>>>>>
> >>>>>>> 0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic
> >>>>>>>> 0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
> >>>>>>>> 0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings
> >>>>>>>> 0 regex/1 (^[0-9]{5})[acdn][w] MARC Classification
> >>>>>>>> 0 regex/1 (^[0-9]{5})[cdn][q] MARC Community
> >>>>>>>>
> >>>>> --
> >>>>> William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
> >>>>>
> >>>>
> >>>> --
> >>>> Owen Stephens
> >>>> Owen Stephens Consulting
> >>>> Web: http://www.ostephens.com
> >>>> Email: [log in to unmask]
> >>>>
|