LISTSERV 16.5 - CODE4LIB Archives

On Jan 23, 2009, at 5:52 AM, Eric Lease Morgan wrote:

> On 1/23/09 4:39 AM, "Brown, Alan" <[log in to unmask]> wrote:
>
>>> Does anybody here know the difference between MARC21 and USMARC?
>>>
>>> I am munging sets of MARC bibliographic data from a III catalog with
>>> holdings data from the same. I am using MARC::Batch to read my bib'
>>> data (with both strict and warnings turned off), insert 853 and 863
>>> fields, and writing the data using the as_usmarc method.  
>>> Therefore, I
>>> think I am creating USMARC files. I can then use marcdump to... dump
>>> the records. It returns 0 errors.
>>
>> Eric, This isn't an encoding thing is it? I know that a number of III
>> catalogues still encode their diacritics using the MARC8 version of
>> USMARC. We have changed ours to Unicode now, but we did have an  
>> issue of
>> the catalogue outputting unicode records that weren't tagged as  
>> such in
>> the leader and so couldn't be identified as proper MARC21 (current
>> version of USMARC). III have solved this with their latest release.  
>> This
>> issue had me scratching my head with a lot of my MARC::Record  
>> scripts,
>> but generally they failed quite spectacularly.
>
>
> Actually, I believe I am suffering from a number of different types of
> errors in my MARC data: 1) encoding issues (MARC8 versus UTF-8), 2)
> syntactical errors (lack of periods, invalid choices of indicators,  
> etc.),
> 3) incorrect data types (strings entered into fields denoted for  
> integers,
> etc.) Just about the only thing I haven't encountered are structural  
> errors
> such as invalid leader, and this doesn't even take into account  
> possible
> data entry errors (author is Franklin when Twain was entered).
>
> Yes, I do have an encoding issue. All of my incoming records are in  
> MARC8.
> I'm not sure, but I think the Primo tool expects UTF-8. I can easily  
> update
> the encoding bit (change leader position 09 from blank to a), but  
> this does
> not change any actual encoding in the bibliographic section of my  
> data.
> Consequently, after updating the encoding bit and looping through my  
> munged
> data MARC::Record chokes on records with the following error where  
> UTF-8 is
> denoted but include MARC8 characters:
>
>  utf8 "\xE8" does not map to Unicode at
>  /usr/lib/perl5/5.8.8/i686-linux/Encode.pm line 166.
>
> Upon looking at the raw MARC see the the offending record includes  
> the word
> Münich. What can I do to transform MARC8 data into UTF-8? What can I  
> do to
> trap the error above, and skip these invalid records?


We've had good luck with the yaz-marcdump utility that's included with  
the YAZ toolkit.  We're  using it to convert our exported Horizon  
records from MARC8 to UTF-8 before we import into AquaBrowser.  The  
tool is easy to compile, blindingly fast, forgiving of common MARC  
errors, and changes the coding correctly. It's been serving us well.

-Tod

Tod Olson <[log in to unmask]>
Systems Librarian
University of Chicago Library