LISTSERV 16.5 - CODE4LIB Archives

This is definitely a pretty typical experience, alas.

Despite all of the library commmunities voiced obsession with doing 
things 'by the book' according to standards, anyone that's actually 
tried to work with an actually existing large corpus of MARC data.... 
finds that is is all over the place, and very non-compliant in many ways.

One of the most annoying things to deal with is that encoding issue. A 
US-MARC/MARC21 record can actually be in MARC-8 encoding OR in UTF-8, 
and there is actually a field (fixed field I think) to declare which 
encoding is used. But in actually existing MARC records, it is not 
uncommon for a record to declare itself as being in one encoding, but 
actually is in the other. This makes MARC records very difficult to deal 
with, definitely.

Jonathan

Eric Lease Morgan wrote:
> On 1/23/09 4:39 AM, "Brown, Alan" <[log in to unmask]> wrote:
>
>   
>>> Does anybody here know the difference between MARC21 and USMARC?
>>>
>>> I am munging sets of MARC bibliographic data from a III catalog with
>>> holdings data from the same. I am using MARC::Batch to read my bib'
>>> data (with both strict and warnings turned off), insert 853 and 863
>>> fields, and writing the data using the as_usmarc method. Therefore, I
>>> think I am creating USMARC files. I can then use marcdump to... dump
>>> the records. It returns 0 errors.
>>>       
>> Eric, This isn't an encoding thing is it? I know that a number of III
>> catalogues still encode their diacritics using the MARC8 version of
>> USMARC. We have changed ours to Unicode now, but we did have an issue of
>> the catalogue outputting unicode records that weren't tagged as such in
>> the leader and so couldn't be identified as proper MARC21 (current
>> version of USMARC). III have solved this with their latest release. This
>> issue had me scratching my head with a lot of my MARC::Record scripts,
>> but generally they failed quite spectacularly.
>>     
>
>
> Actually, I believe I am suffering from a number of different types of
> errors in my MARC data: 1) encoding issues (MARC8 versus UTF-8), 2)
> syntactical errors (lack of periods, invalid choices of indicators, etc.),
> 3) incorrect data types (strings entered into fields denoted for integers,
> etc.) Just about the only thing I haven't encountered are structural errors
> such as invalid leader, and this doesn't even take into account possible
> data entry errors (author is Franklin when Twain was entered).
>
> Yes, I do have an encoding issue. All of my incoming records are in MARC8.
> I'm not sure, but I think the Primo tool expects UTF-8. I can easily update
> the encoding bit (change leader position 09 from blank to a), but this does
> not change any actual encoding in the bibliographic section of my data.
> Consequently, after updating the encoding bit and looping through my munged
> data MARC::Record chokes on records with the following error where UTF-8 is
> denoted but include MARC8 characters:
>
>   utf8 "\xE8" does not map to Unicode at
>   /usr/lib/perl5/5.8.8/i686-linux/Encode.pm line 166.
>
> Upon looking at the raw MARC see the the offending record includes the word
> Münich. What can I do to transform MARC8 data into UTF-8? What can I do to
> trap the error above, and skip these invalid records?
>
>   

-- 
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886 
rochkind (at) jhu.edu