Eric- I don't know the original source of those MARC files, but I've worked with files from an III system where diacritics had to be entered as character code escapes like "Muse{226}e du Louvre" (where 226 is the ANSEL code for a combining acute accent). So if somebody made a typo and entered something like "Muse{22}6e du Louvre" instead, you'd get some bogus invalid character. I was working with MARCXML files in Java, so I wrote a FilterReader class that removed any characters that were invalid in UTF-8 XML. I assume you could do something similar in Perl (probably with a fancy one-line regex). -Esme -- Esme Cowles <[log in to unmask]> "We've all heard that a million monkeys banging on a million typewriters will eventually reproduce the works of Shakespeare. Now, thanks to the Internet, we know this is not true." -- Robert Wilensky On Oct 7, 2010, at 6:51 AM, Eric Lease Morgan wrote: > How do I trap for unwanted (bogus) characters in MARC records? > > I have a set of Internet Archive identifiers, and have written the followoing Perl loop to get the MARC records associated with each one: > > # process each identifier > my $ua = LWP::UserAgent->new( agent => AGENT ); > while ( <DATA> ) { > > # get the identifier > chop; > my $identifier = $_; > print $identifier, "\n"; > > # get its corresponding MARC record > my $response = $ua->get( ROOT . "$identifier/$identifier" . "_meta.mrc" ); > if ( ! $response->is_success ) { > > warn $response->status_line; > next; > > } > > # save it > open MARC, " > $identifier.mrc" or die "Can't open $identifier.mrc: $!\n"; > binmode MARC, ":utf8"; > print MARC $response->content; > close MARC; > > } > > I then use the venerable marcdump to see the fruits of my labors: marcdump *.mrc. Unfortunately, marcdump returns the following error against (at least) one of my files: > > bienfaitsducatho00pina.mrc > utf8 "\xC3" does not map to Unicode at /System/Library/ > Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162. > > What is going on here? Am I saving my files incorrectly? Is the original MARC data inherintly incorrect? Is there some way I can fix the MARC record in question? > > -- > Eric Lease Morgan