Eric, is this your source file?
http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_meta.mrc
I have nothing really much to offer with regard to MARC.pm and its
ilk, but I thought it might help people track down your problem.
FWIW, yaz-marcdump spits out this on that record:
$ yaz-marcdump bienfait.mrc
00795cam a2200277 a 4500
001 1556719
003 CaOTULAS
005 19931129144435.0
008 780210s1842 fr fre d
035 $a (Sirsi) AZF-9578
040 $a NUC $c NUC $d otsm
049 $a otstm $b eng
050 04 $a BX946 $b .P5
055 3 $a BX946 $b .P55 1842
090 8 $a BX 946 .P55 1842 $b SMRS
100 10 $a Pinard, Clovis, $d d.1865.
245 10 $a Bienfaits du Catholicisme dans la société / $c par l'abbé P
(No separator at end of field length=71)
260 na $d .
(Separator but not at end of field length=26)
300 18 $2 .
(Separator but not at end of field length=11)
490 00 $p .
(Separator but not at end of field length=45)
(Bad indicator data. Skipping 2 bytes)
596 ?t $e nn
(No separator at end of field length=7)
610 ne
(Separator but not at end of field length=30)
948 xH $s tory.
(Separator but not at end of field length=27)
039 0/ $6 /199
(No separator at end of field length=9)
(Bad indicator data. Skipping 1 bytes)
093 0 $f mcsk
(Separator but not at end of field length=21)
926 12 $1 44434
(Separator but not at end of field length=48)
The diacritics definitely look pretty sketchy there.
In fact, I just tried this with every encoding in yaz-marcdump, and
the diacritics never properly converted to UTF-8.
They seem ok here:
http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_marc.xml
though, so you might want to grab both binary marc and marcxml and
fall back to the latter in case of encoding errors.
-Ross.
On Thu, Oct 7, 2010 at 6:51 AM, Eric Lease Morgan <[log in to unmask]> wrote:
> How do I trap for unwanted (bogus) characters in MARC records?
>
> I have a set of Internet Archive identifiers, and have written the followoing Perl loop to get the MARC records associated with each one:
>
> # process each identifier
> my $ua = LWP::UserAgent->new( agent => AGENT );
> while ( <DATA> ) {
>
> # get the identifier
> chop;
> my $identifier = $_;
> print $identifier, "\n";
>
> # get its corresponding MARC record
> my $response = $ua->get( ROOT . "$identifier/$identifier" . "_meta.mrc" );
> if ( ! $response->is_success ) {
>
> warn $response->status_line;
> next;
>
> }
>
> # save it
> open MARC, " > $identifier.mrc" or die "Can't open $identifier.mrc: $!\n";
> binmode MARC, ":utf8";
> print MARC $response->content;
> close MARC;
>
> }
>
> I then use the venerable marcdump to see the fruits of my labors: marcdump *.mrc. Unfortunately, marcdump returns the following error against (at least) one of my files:
>
> bienfaitsducatho00pina.mrc
> utf8 "\xC3" does not map to Unicode at /System/Library/
> Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162.
>
> What is going on here? Am I saving my files incorrectly? Is the original MARC data inherintly incorrect? Is there some way I can fix the MARC record in question?
>
> --
> Eric Lease Morgan
>
|