This record has the classic signs of Unicode treated as Latin-1 by mistake. The multibyte characters often show up as à followed by some other random character. This actually happened to my conference badge in Asheville, which read "Esmé Cowles". -Esmé -- Esme Cowles <[log in to unmask]> "Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves." -- William Pitt, 1783 On Oct 7, 2010, at 9:39 AM, Ross Singer wrote: > Eric, is this your source file? > > http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_meta.mrc > > I have nothing really much to offer with regard to MARC.pm and its > ilk, but I thought it might help people track down your problem. > > FWIW, yaz-marcdump spits out this on that record: > > $ yaz-marcdump bienfait.mrc > 00795cam a2200277 a 4500 > 001 1556719 > 003 CaOTULAS > 005 19931129144435.0 > 008 780210s1842 fr fre d > 035 $a (Sirsi) AZF-9578 > 040 $a NUC $c NUC $d otsm > 049 $a otstm $b eng > 050 04 $a BX946 $b .P5 > 055 3 $a BX946 $b .P55 1842 > 090 8 $a BX 946 .P55 1842 $b SMRS > 100 10 $a Pinard, Clovis, $d d.1865. > 245 10 $a Bienfaits du Catholicisme dans la société / $c par l'abbé P > (No separator at end of field length=71) > 260 na $d . > (Separator but not at end of field length=26) > 300 18 $2 . > (Separator but not at end of field length=11) > 490 00 $p . > (Separator but not at end of field length=45) > (Bad indicator data. Skipping 2 bytes) > 596 ?t $e nn > (No separator at end of field length=7) > 610 ne > (Separator but not at end of field length=30) > 948 xH $s tory. > (Separator but not at end of field length=27) > 039 0/ $6 /199 > (No separator at end of field length=9) > (Bad indicator data. Skipping 1 bytes) > 093 0 $f mcsk > (Separator but not at end of field length=21) > 926 12 $1 44434 > (Separator but not at end of field length=48) > > The diacritics definitely look pretty sketchy there. > > In fact, I just tried this with every encoding in yaz-marcdump, and > the diacritics never properly converted to UTF-8. > > They seem ok here: > > http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_marc.xml > > though, so you might want to grab both binary marc and marcxml and > fall back to the latter in case of encoding errors. > > -Ross. > > On Thu, Oct 7, 2010 at 6:51 AM, Eric Lease Morgan <[log in to unmask]> wrote: >> How do I trap for unwanted (bogus) characters in MARC records? >> >> I have a set of Internet Archive identifiers, and have written the followoing Perl loop to get the MARC records associated with each one: >> >> # process each identifier >> my $ua = LWP::UserAgent->new( agent => AGENT ); >> while ( <DATA> ) { >> >> # get the identifier >> chop; >> my $identifier = $_; >> print $identifier, "\n"; >> >> # get its corresponding MARC record >> my $response = $ua->get( ROOT . "$identifier/$identifier" . "_meta.mrc" ); >> if ( ! $response->is_success ) { >> >> warn $response->status_line; >> next; >> >> } >> >> # save it >> open MARC, " > $identifier.mrc" or die "Can't open $identifier.mrc: $!\n"; >> binmode MARC, ":utf8"; >> print MARC $response->content; >> close MARC; >> >> } >> >> I then use the venerable marcdump to see the fruits of my labors: marcdump *.mrc. Unfortunately, marcdump returns the following error against (at least) one of my files: >> >> bienfaitsducatho00pina.mrc >> utf8 "\xC3" does not map to Unicode at /System/Library/ >> Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162. >> >> What is going on here? Am I saving my files incorrectly? Is the original MARC data inherintly incorrect? Is there some way I can fix the MARC record in question? >> >> -- >> Eric Lease Morgan >>