There's something about this that's tugging at my memory that hints it
might not be quite what the error message said as far as an invalid
unicode character.
I guess my first couple of questions:
1) What identifiers/records are you pulling? I didn't see any actual
examples in your email. Can you construct the url that the perl
script is doing and give it to us?
I'd guess it's very likely the original marc record is goofed up due
to some transforms. I've seen it from people doing really weird
things to records as part of the submit process to IA.
2) You're sure that is a unicode marc record and not marc-8, right?
3) What version is your MARC::Record module? Might want to upgrade if
it's old, there's been some bug fixes.
Jon Gorman
On Thu, Oct 7, 2010 at 5:51 AM, Eric Lease Morgan <[log in to unmask]> wrote:
> How do I trap for unwanted (bogus) characters in MARC records?
>
> I have a set of Internet Archive identifiers, and have written the followoing Perl loop to get the MARC records associated with each one:
>
> # process each identifier
> my $ua = LWP::UserAgent->new( agent => AGENT );
> while ( <DATA> ) {
>
> # get the identifier
> chop;
> my $identifier = $_;
> print $identifier, "\n";
>
> # get its corresponding MARC record
> my $response = $ua->get( ROOT . "$identifier/$identifier" . "_meta.mrc" );
> if ( ! $response->is_success ) {
>
> warn $response->status_line;
> next;
>
> }
>
> # save it
> open MARC, " > $identifier.mrc" or die "Can't open $identifier.mrc: $!\n";
> binmode MARC, ":utf8";
> print MARC $response->content;
> close MARC;
>
> }
>
> I then use the venerable marcdump to see the fruits of my labors: marcdump *.mrc. Unfortunately, marcdump returns the following error against (at least) one of my files:
>
> bienfaitsducatho00pina.mrc
> utf8 "\xC3" does not map to Unicode at /System/Library/
> Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162.
>
> What is going on here? Am I saving my files incorrectly? Is the original MARC data inherintly incorrect? Is there some way I can fix the MARC record in question?
>
> --
> Eric Lease Morgan
>
|