On Jan 23, 2009, at 5:52 AM, Eric Lease Morgan wrote: > On 1/23/09 4:39 AM, "Brown, Alan" <[log in to unmask]> wrote: > >>> Does anybody here know the difference between MARC21 and USMARC? >>> >>> I am munging sets of MARC bibliographic data from a III catalog with >>> holdings data from the same. I am using MARC::Batch to read my bib' >>> data (with both strict and warnings turned off), insert 853 and 863 >>> fields, and writing the data using the as_usmarc method. >>> Therefore, I >>> think I am creating USMARC files. I can then use marcdump to... dump >>> the records. It returns 0 errors. >> >> Eric, This isn't an encoding thing is it? I know that a number of III >> catalogues still encode their diacritics using the MARC8 version of >> USMARC. We have changed ours to Unicode now, but we did have an >> issue of >> the catalogue outputting unicode records that weren't tagged as >> such in >> the leader and so couldn't be identified as proper MARC21 (current >> version of USMARC). III have solved this with their latest release. >> This >> issue had me scratching my head with a lot of my MARC::Record >> scripts, >> but generally they failed quite spectacularly. > > > Actually, I believe I am suffering from a number of different types of > errors in my MARC data: 1) encoding issues (MARC8 versus UTF-8), 2) > syntactical errors (lack of periods, invalid choices of indicators, > etc.), > 3) incorrect data types (strings entered into fields denoted for > integers, > etc.) Just about the only thing I haven't encountered are structural > errors > such as invalid leader, and this doesn't even take into account > possible > data entry errors (author is Franklin when Twain was entered). > > Yes, I do have an encoding issue. All of my incoming records are in > MARC8. > I'm not sure, but I think the Primo tool expects UTF-8. I can easily > update > the encoding bit (change leader position 09 from blank to a), but > this does > not change any actual encoding in the bibliographic section of my > data. > Consequently, after updating the encoding bit and looping through my > munged > data MARC::Record chokes on records with the following error where > UTF-8 is > denoted but include MARC8 characters: > > utf8 "\xE8" does not map to Unicode at > /usr/lib/perl5/5.8.8/i686-linux/Encode.pm line 166. > > Upon looking at the raw MARC see the the offending record includes > the word > Münich. What can I do to transform MARC8 data into UTF-8? What can I > do to > trap the error above, and skip these invalid records? We've had good luck with the yaz-marcdump utility that's included with the YAZ toolkit. We're using it to convert our exported Horizon records from MARC8 to UTF-8 before we import into AquaBrowser. The tool is easy to compile, blindingly fast, forgiving of common MARC errors, and changes the coding correctly. It's been serving us well. -Tod Tod Olson <[log in to unmask]> Systems Librarian University of Chicago Library