Eric-
I don't know the original source of those MARC files, but I've worked with files from an III system where diacritics had to be entered as character code escapes like "Muse{226}e du Louvre" (where 226 is the ANSEL code for a combining acute accent). So if somebody made a typo and entered something like "Muse{22}6e du Louvre" instead, you'd get some bogus invalid character. I was working with MARCXML files in Java, so I wrote a FilterReader class that removed any characters that were invalid in UTF-8 XML. I assume you could do something similar in Perl (probably with a fancy one-line regex).
-Esme
--
Esme Cowles <[log in to unmask]>
"We've all heard that a million monkeys banging on a million typewriters
will eventually reproduce the works of Shakespeare. Now, thanks to the
Internet, we know this is not true." -- Robert Wilensky
On Oct 7, 2010, at 6:51 AM, Eric Lease Morgan wrote:
> How do I trap for unwanted (bogus) characters in MARC records?
>
> I have a set of Internet Archive identifiers, and have written the followoing Perl loop to get the MARC records associated with each one:
>
> # process each identifier
> my $ua = LWP::UserAgent->new( agent => AGENT );
> while ( <DATA> ) {
>
> # get the identifier
> chop;
> my $identifier = $_;
> print $identifier, "\n";
>
> # get its corresponding MARC record
> my $response = $ua->get( ROOT . "$identifier/$identifier" . "_meta.mrc" );
> if ( ! $response->is_success ) {
>
> warn $response->status_line;
> next;
>
> }
>
> # save it
> open MARC, " > $identifier.mrc" or die "Can't open $identifier.mrc: $!\n";
> binmode MARC, ":utf8";
> print MARC $response->content;
> close MARC;
>
> }
>
> I then use the venerable marcdump to see the fruits of my labors: marcdump *.mrc. Unfortunately, marcdump returns the following error against (at least) one of my files:
>
> bienfaitsducatho00pina.mrc
> utf8 "\xC3" does not map to Unicode at /System/Library/
> Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162.
>
> What is going on here? Am I saving my files incorrectly? Is the original MARC data inherintly incorrect? Is there some way I can fix the MARC record in question?
>
> --
> Eric Lease Morgan
|