


I don't know the original source of those MARC files, but I've worked with files from an III system where diacritics had to be entered as character code escapes like "Muse{226}e du Louvre" (where 226 is the ANSEL code for a combining acute accent).  So if somebody made a typo and entered something like "Muse{22}6e du Louvre" instead, you'd get some bogus invalid character.  I was working with MARCXML files in Java, so I wrote a FilterReader class that removed any characters that were invalid in UTF-8 XML.  I assume you could do something similar in Perl (probably with a fancy one-line regex).

On Oct 7, 2010, at 6:51 AM, Eric Lease Morgan wrote:

> How do I trap for unwanted (bogus) characters in MARC records?
> I have a set of Internet Archive identifiers, and have written the followoing Perl loop to get the MARC records associated with each one:
>  # process each identifier
>  my $ua = LWP::UserAgent->new( agent => AGENT );
>  while ( <DATA> ) {
>    # get the identifier
>    chop;
>    my $identifier = $_;
>    print $identifier, "\n";
>    # get its corresponding MARC record
>    my $response = $ua->get( ROOT . "$identifier/$identifier" . "_meta.mrc" );
>    if ( ! $response->is_success ) {
>      warn $response->status_line;
>      next;
>    }
>    # save it
>    open MARC, " > $identifier.mrc" or die "Can't open $identifier.mrc: $!\n";
>    binmode MARC, ":utf8";
>    print MARC $response->content;
>    close MARC;
>  }
> I then use the venerable marcdump to see the fruits of my labors: marcdump *.mrc. Unfortunately, marcdump returns the following error against (at least) one of my files:
>  bienfaitsducatho00pina.mrc
>  utf8 "\xC3" does not map to Unicode at /System/Library/
>  Perl/5.10.0/darwin-thread-multi-2level/ line 162.
> What is going on here? Am I saving my files incorrectly? Is the original MARC data inherintly incorrect? Is there some way I can fix the MARC record in question?
