In Perl, something like this might do the trick: # Fix non-UTF-8 characters with two highest bits set (we assume they are actually ISO-8859-1) # Rule: there can't be a single byte with the high bits set followed by a byte in range 00-7F or C0-FF $str =~ s/([\xC0-\xFF])(?=[\x00-\x7f\xC0-\xFF])/chr(0xC0 + (ord($1) >> 6)) . chr(0x80 + (ord($1) & 0x3F))/seg; No wrapping there to keep it single-line. :) --Ere On 7.10.2010 14:56, Cowles, Esme wrote: > Eric- > > I don't know the original source of those MARC files, but I've worked > with files from an III system where diacritics had to be entered as > character code escapes like "Muse{226}e du Louvre" (where 226 is the > ANSEL code for a combining acute accent). So if somebody made a typo > and entered something like "Muse{22}6e du Louvre" instead, you'd get > some bogus invalid character. I was working with MARCXML files in > Java, so I wrote a FilterReader class that removed any characters > that were invalid in UTF-8 XML. I assume you could do something > similar in Perl (probably with a fancy one-line regex). > > -Esme -- Esme Cowles<[log in to unmask]> > > "We've all heard that a million monkeys banging on a million > typewriters will eventually reproduce the works of Shakespeare. Now, > thanks to the Internet, we know this is not true." -- Robert > Wilensky > > On Oct 7, 2010, at 6:51 AM, Eric Lease Morgan wrote: > >> How do I trap for unwanted (bogus) characters in MARC records? >> >> I have a set of Internet Archive identifiers, and have written the >> followoing Perl loop to get the MARC records associated with each >> one: >> >> # process each identifier my $ua = LWP::UserAgent->new( agent => >> AGENT ); while (<DATA> ) { >> >> # get the identifier chop; my $identifier = $_; print $identifier, >> "\n"; >> >> # get its corresponding MARC record my $response = $ua->get( ROOT . >> "$identifier/$identifier" . "_meta.mrc" ); if ( ! >> $response->is_success ) { >> >> warn $response->status_line; next; >> >> } >> >> # save it open MARC, "> $identifier.mrc" or die "Can't open >> $identifier.mrc: $!\n"; binmode MARC, ":utf8"; print MARC >> $response->content; close MARC; >> >> } >> >> I then use the venerable marcdump to see the fruits of my labors: >> marcdump *.mrc. Unfortunately, marcdump returns the following error >> against (at least) one of my files: >> >> bienfaitsducatho00pina.mrc utf8 "\xC3" does not map to Unicode at >> /System/Library/ Perl/5.10.0/darwin-thread-multi-2level/Encode.pm >> line 162. >> >> What is going on here? Am I saving my files incorrectly? Is the >> original MARC data inherintly incorrect? Is there some way I can >> fix the MARC record in question? >> >> -- Eric Lease Morgan > -- Ere Maijala Kansalliskirjasto