Print

Print


Eric, is this your source file?

http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_meta.mrc

I have nothing really much to offer with regard to MARC.pm and its
ilk, but I thought it might help people track down your problem.

FWIW, yaz-marcdump spits out this on that record:

$ yaz-marcdump bienfait.mrc
00795cam a2200277 a 4500
001 1556719
003 CaOTULAS
005 19931129144435.0
008 780210s1842    fr                  fre d
035    $a (Sirsi) AZF-9578
040    $a NUC $c NUC $d otsm
049    $a otstm $b eng
050 04 $a BX946 $b .P5
055  3 $a BX946 $b .P55 1842
090  8 $a BX 946 .P55 1842 $b SMRS
100 10 $a Pinard, Clovis, $d d.1865.
245 10 $a Bienfaits du Catholicisme dans la société / $c par l'abbé P
(No separator at end of field length=71)
260 na $d .
(Separator but not at end of field length=26)
300 18 $2 .
(Separator but not at end of field length=11)
490 00 $p .
(Separator but not at end of field length=45)
(Bad indicator data. Skipping 2 bytes)
596 ?t $e nn
(No separator at end of field length=7)
610 ne
(Separator but not at end of field length=30)
948 xH $s tory.
(Separator but not at end of field length=27)
039 0/ $6 /199
(No separator at end of field length=9)
(Bad indicator data. Skipping 1 bytes)
093 0  $f mcsk
(Separator but not at end of field length=21)
926 12 $1 44434
(Separator but not at end of field length=48)

The diacritics definitely look pretty sketchy there.

In fact, I just tried this with every encoding in yaz-marcdump, and
the diacritics never properly converted to UTF-8.

They seem ok here:

http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_marc.xml

though, so you might want to grab both binary marc and marcxml and
fall back to the latter in case of encoding errors.

-Ross.

On Thu, Oct 7, 2010 at 6:51 AM, Eric Lease Morgan <[log in to unmask]> wrote:
> How do I trap for unwanted (bogus) characters in MARC records?
>
> I have a set of Internet Archive identifiers, and have written the followoing Perl loop to get the MARC records associated with each one:
>
>  # process each identifier
>  my $ua = LWP::UserAgent->new( agent => AGENT );
>  while ( <DATA> ) {
>
>    # get the identifier
>    chop;
>    my $identifier = $_;
>    print $identifier, "\n";
>
>    # get its corresponding MARC record
>    my $response = $ua->get( ROOT . "$identifier/$identifier" . "_meta.mrc" );
>    if ( ! $response->is_success ) {
>
>      warn $response->status_line;
>      next;
>
>    }
>
>    # save it
>    open MARC, " > $identifier.mrc" or die "Can't open $identifier.mrc: $!\n";
>    binmode MARC, ":utf8";
>    print MARC $response->content;
>    close MARC;
>
>  }
>
> I then use the venerable marcdump to see the fruits of my labors: marcdump *.mrc. Unfortunately, marcdump returns the following error against (at least) one of my files:
>
>  bienfaitsducatho00pina.mrc
>  utf8 "\xC3" does not map to Unicode at /System/Library/
>  Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162.
>
> What is going on here? Am I saving my files incorrectly? Is the original MARC data inherintly incorrect? Is there some way I can fix the MARC record in question?
>
> --
> Eric Lease Morgan
>