In Perl, something like this might do the trick:
# Fix non-UTF-8 characters with two highest bits set (we assume they are
actually ISO-8859-1)
# Rule: there can't be a single byte with the high bits set followed by
a byte in range 00-7F or C0-FF
$str =~ s/([\xC0-\xFF])(?=[\x00-\x7f\xC0-\xFF])/chr(0xC0 + (ord($1) >>
6)) . chr(0x80 + (ord($1) & 0x3F))/seg;
No wrapping there to keep it single-line. :)
--Ere
On 7.10.2010 14:56, Cowles, Esme wrote:
> Eric-
>
> I don't know the original source of those MARC files, but I've worked
> with files from an III system where diacritics had to be entered as
> character code escapes like "Muse{226}e du Louvre" (where 226 is the
> ANSEL code for a combining acute accent). So if somebody made a typo
> and entered something like "Muse{22}6e du Louvre" instead, you'd get
> some bogus invalid character. I was working with MARCXML files in
> Java, so I wrote a FilterReader class that removed any characters
> that were invalid in UTF-8 XML. I assume you could do something
> similar in Perl (probably with a fancy one-line regex).
>
> -Esme -- Esme Cowles<[log in to unmask]>
>
> "We've all heard that a million monkeys banging on a million
> typewriters will eventually reproduce the works of Shakespeare. Now,
> thanks to the Internet, we know this is not true." -- Robert
> Wilensky
>
> On Oct 7, 2010, at 6:51 AM, Eric Lease Morgan wrote:
>
>> How do I trap for unwanted (bogus) characters in MARC records?
>>
>> I have a set of Internet Archive identifiers, and have written the
>> followoing Perl loop to get the MARC records associated with each
>> one:
>>
>> # process each identifier my $ua = LWP::UserAgent->new( agent =>
>> AGENT ); while (<DATA> ) {
>>
>> # get the identifier chop; my $identifier = $_; print $identifier,
>> "\n";
>>
>> # get its corresponding MARC record my $response = $ua->get( ROOT .
>> "$identifier/$identifier" . "_meta.mrc" ); if ( !
>> $response->is_success ) {
>>
>> warn $response->status_line; next;
>>
>> }
>>
>> # save it open MARC, "> $identifier.mrc" or die "Can't open
>> $identifier.mrc: $!\n"; binmode MARC, ":utf8"; print MARC
>> $response->content; close MARC;
>>
>> }
>>
>> I then use the venerable marcdump to see the fruits of my labors:
>> marcdump *.mrc. Unfortunately, marcdump returns the following error
>> against (at least) one of my files:
>>
>> bienfaitsducatho00pina.mrc utf8 "\xC3" does not map to Unicode at
>> /System/Library/ Perl/5.10.0/darwin-thread-multi-2level/Encode.pm
>> line 162.
>>
>> What is going on here? Am I saving my files incorrectly? Is the
>> original MARC data inherintly incorrect? Is there some way I can
>> fix the MARC record in question?
>>
>> -- Eric Lease Morgan
>
--
Ere Maijala
Kansalliskirjasto
|