Print

Print


This record has the classic signs of Unicode treated as Latin-1 by mistake.  The multibyte characters often show up as à followed by some other random character.  This actually happened to my conference badge in Asheville, which read "Esmé Cowles".

-Esmé
--
Esme Cowles <[log in to unmask]>

"Necessity is the plea for every infringement of human freedom. It is the
 argument of tyrants; it is the creed of slaves." -- William Pitt, 1783

On Oct 7, 2010, at 9:39 AM, Ross Singer wrote:

> Eric, is this your source file?
> 
> http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_meta.mrc
> 
> I have nothing really much to offer with regard to MARC.pm and its
> ilk, but I thought it might help people track down your problem.
> 
> FWIW, yaz-marcdump spits out this on that record:
> 
> $ yaz-marcdump bienfait.mrc
> 00795cam a2200277 a 4500
> 001 1556719
> 003 CaOTULAS
> 005 19931129144435.0
> 008 780210s1842    fr                  fre d
> 035    $a (Sirsi) AZF-9578
> 040    $a NUC $c NUC $d otsm
> 049    $a otstm $b eng
> 050 04 $a BX946 $b .P5
> 055  3 $a BX946 $b .P55 1842
> 090  8 $a BX 946 .P55 1842 $b SMRS
> 100 10 $a Pinard, Clovis, $d d.1865.
> 245 10 $a Bienfaits du Catholicisme dans la société / $c par l'abbé P
> (No separator at end of field length=71)
> 260 na $d .
> (Separator but not at end of field length=26)
> 300 18 $2 .
> (Separator but not at end of field length=11)
> 490 00 $p .
> (Separator but not at end of field length=45)
> (Bad indicator data. Skipping 2 bytes)
> 596 ?t $e nn
> (No separator at end of field length=7)
> 610 ne
> (Separator but not at end of field length=30)
> 948 xH $s tory.
> (Separator but not at end of field length=27)
> 039 0/ $6 /199
> (No separator at end of field length=9)
> (Bad indicator data. Skipping 1 bytes)
> 093 0  $f mcsk
> (Separator but not at end of field length=21)
> 926 12 $1 44434
> (Separator but not at end of field length=48)
> 
> The diacritics definitely look pretty sketchy there.
> 
> In fact, I just tried this with every encoding in yaz-marcdump, and
> the diacritics never properly converted to UTF-8.
> 
> They seem ok here:
> 
> http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_marc.xml
> 
> though, so you might want to grab both binary marc and marcxml and
> fall back to the latter in case of encoding errors.
> 
> -Ross.
> 
> On Thu, Oct 7, 2010 at 6:51 AM, Eric Lease Morgan <[log in to unmask]> wrote:
>> How do I trap for unwanted (bogus) characters in MARC records?
>> 
>> I have a set of Internet Archive identifiers, and have written the followoing Perl loop to get the MARC records associated with each one:
>> 
>>  # process each identifier
>>  my $ua = LWP::UserAgent->new( agent => AGENT );
>>  while ( <DATA> ) {
>> 
>>    # get the identifier
>>    chop;
>>    my $identifier = $_;
>>    print $identifier, "\n";
>> 
>>    # get its corresponding MARC record
>>    my $response = $ua->get( ROOT . "$identifier/$identifier" . "_meta.mrc" );
>>    if ( ! $response->is_success ) {
>> 
>>      warn $response->status_line;
>>      next;
>> 
>>    }
>> 
>>    # save it
>>    open MARC, " > $identifier.mrc" or die "Can't open $identifier.mrc: $!\n";
>>    binmode MARC, ":utf8";
>>    print MARC $response->content;
>>    close MARC;
>> 
>>  }
>> 
>> I then use the venerable marcdump to see the fruits of my labors: marcdump *.mrc. Unfortunately, marcdump returns the following error against (at least) one of my files:
>> 
>>  bienfaitsducatho00pina.mrc
>>  utf8 "\xC3" does not map to Unicode at /System/Library/
>>  Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162.
>> 
>> What is going on here? Am I saving my files incorrectly? Is the original MARC data inherintly incorrect? Is there some way I can fix the MARC record in question?
>> 
>> --
>> Eric Lease Morgan
>>