LISTSERV 16.5 - CODE4LIB Archives

On 1/23/09 4:39 AM, "Brown, Alan" <[log in to unmask]> wrote:

>> Does anybody here know the difference between MARC21 and USMARC?
>> 
>> I am munging sets of MARC bibliographic data from a III catalog with
>> holdings data from the same. I am using MARC::Batch to read my bib'
>> data (with both strict and warnings turned off), insert 853 and 863
>> fields, and writing the data using the as_usmarc method. Therefore, I
>> think I am creating USMARC files. I can then use marcdump to... dump
>> the records. It returns 0 errors.
> 
> Eric, This isn't an encoding thing is it? I know that a number of III
> catalogues still encode their diacritics using the MARC8 version of
> USMARC. We have changed ours to Unicode now, but we did have an issue of
> the catalogue outputting unicode records that weren't tagged as such in
> the leader and so couldn't be identified as proper MARC21 (current
> version of USMARC). III have solved this with their latest release. This
> issue had me scratching my head with a lot of my MARC::Record scripts,
> but generally they failed quite spectacularly.


Actually, I believe I am suffering from a number of different types of
errors in my MARC data: 1) encoding issues (MARC8 versus UTF-8), 2)
syntactical errors (lack of periods, invalid choices of indicators, etc.),
3) incorrect data types (strings entered into fields denoted for integers,
etc.) Just about the only thing I haven't encountered are structural errors
such as invalid leader, and this doesn't even take into account possible
data entry errors (author is Franklin when Twain was entered).

Yes, I do have an encoding issue. All of my incoming records are in MARC8.
I'm not sure, but I think the Primo tool expects UTF-8. I can easily update
the encoding bit (change leader position 09 from blank to a), but this does
not change any actual encoding in the bibliographic section of my data.
Consequently, after updating the encoding bit and looping through my munged
data MARC::Record chokes on records with the following error where UTF-8 is
denoted but include MARC8 characters:

  utf8 "\xE8" does not map to Unicode at
  /usr/lib/perl5/5.8.8/i686-linux/Encode.pm line 166.

Upon looking at the raw MARC see the the offending record includes the word
Münich. What can I do to transform MARC8 data into UTF-8? What can I do to
trap the error above, and skip these invalid records?

-- 
Eric Lease Morgan