Print

Print


The marcxml version of the record looks fine:

http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_marc.xml

Perhaps there was a problem when the record was converted from marcxml to marc at some point.

-Tim

Tim Prettyman
University of Michigan/LIT



On 10/7/10 9:58 AM, "Cowles, Esme" <[log in to unmask]> wrote:

This record has the classic signs of Unicode treated as Latin-1 by mistake.  The multibyte characters often show up as à followed by some other random character.  This actually happened to my conference badge in Asheville, which read "Esmé Cowles".

-Esmé
--
Esme Cowles <[log in to unmask]>

"Necessity is the plea for every infringement of human freedom. It is the
 argument of tyrants; it is the creed of slaves." -- William Pitt, 1783

On Oct 7, 2010, at 9:39 AM, Ross Singer wrote:

> Eric, is this your source file?
>
> http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_meta.mrc
>
> I have nothing really much to offer with regard to MARC.pm and its
> ilk, but I thought it might help people track down your problem.
>
> FWIW, yaz-marcdump spits out this on that record:
>
> $ yaz-marcdump bienfait.mrc
> 00795cam a2200277 a 4500
> 001 1556719
> 003 CaOTULAS
> 005 19931129144435.0
> 008 780210s1842    fr                  fre d
> 035    $a (Sirsi) AZF-9578
> 040    $a NUC $c NUC $d otsm
> 049    $a otstm $b eng
> 050 04 $a BX946 $b .P5
> 055  3 $a BX946 $b .P55 1842
> 090  8 $a BX 946 .P55 1842 $b SMRS
> 100 10 $a Pinard, Clovis, $d d.1865.
> 245 10 $a Bienfaits du Catholicisme dans la société / $c par l'abbé P
> (No separator at end of field length=71)
> 260 na $d .
> (Separator but not at end of field length=26)
> 300 18 $2 .
> (Separator but not at end of field length=11)
> 490 00 $p .
> (Separator but not at end of field length=45)
> (Bad indicator data. Skipping 2 bytes)
> 596 ?t $e nn
> (No separator at end of field length=7)
> 610 ne
> (Separator but not at end of field length=30)
> 948 xH $s tory.
> (Separator but not at end of field length=27)
> 039 0/ $6 /199
> (No separator at end of field length=9)
> (Bad indicator data. Skipping 1 bytes)
> 093 0  $f mcsk
> (Separator but not at end of field length=21)
> 926 12 $1 44434
> (Separator but not at end of field length=48)
>
> The diacritics definitely look pretty sketchy there.
>
> In fact, I just tried this with every encoding in yaz-marcdump, and
> the diacritics never properly converted to UTF-8.
>
> They seem ok here:
>
> http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_marc.xml
>
> though, so you might want to grab both binary marc and marcxml and
> fall back to the latter in case of encoding errors.
>
> -Ross.
>
> On Thu, Oct 7, 2010 at 6:51 AM, Eric Lease Morgan <[log in to unmask]> wrote:
>> How do I trap for unwanted (bogus) characters in MARC records?
>>
>> I have a set of Internet Archive identifiers, and have written the followoing Perl loop to get the MARC records associated with each one:
>>
>>  # process each identifier
>>  my $ua = LWP::UserAgent->new( agent => AGENT );
>>  while ( <DATA> ) {
>>
>>    # get the identifier
>>    chop;
>>    my $identifier = $_;
>>    print $identifier, "\n";
>>
>>    # get its corresponding MARC record
>>    my $response = $ua->get( ROOT . "$identifier/$identifier" . "_meta.mrc" );
>>    if ( ! $response->is_success ) {
>>
>>      warn $response->status_line;
>>      next;
>>
>>    }
>>
>>    # save it
>>    open MARC, " > $identifier.mrc" or die "Can't open $identifier.mrc: $!\n";
>>    binmode MARC, ":utf8";
>>    print MARC $response->content;
>>    close MARC;
>>
>>  }
>>
>> I then use the venerable marcdump to see the fruits of my labors: marcdump *.mrc. Unfortunately, marcdump returns the following error against (at least) one of my files:
>>
>>  bienfaitsducatho00pina.mrc
>>  utf8 "\xC3" does not map to Unicode at /System/Library/
>>  Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162.
>>
>> What is going on here? Am I saving my files incorrectly? Is the original MARC data inherintly incorrect? Is there some way I can fix the MARC record in question?
>>
>> --
>> Eric Lease Morgan
>>