Hi,
a few days ago, I showed pymarc to a group of technical librarians to
demonstrate how easily certain tasks can be scripted/automated.
Unfortunately, it blew up at me when I tried to write a record:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9:
ordinal not in range(128)
Investigation revealed this culprit:
=LDR 00916nam a2200241I 4500
=001 ocm10685946
=005 19880203211447.0
=007 cr\bn||||||abp
=007 cr\bn||||||cda
=008 840503s1939\\\\gw\\\\\\\\\\\\00010\ger\d
=040 \\$aMBB$cMBB$dCRL
=049 \\$aCRLL
=100 10$aEsser, Hermann,$d1900-
=245 14$aDie j<E8>udischer Weltpest ;$bjudend<E1>ammerung auf dem
Erdball,$cvon Hermann Esser.
=260 0\$aM<E8>unchen,$bZentralverlag der N S D A P., F. Eher ahchf.,$c1939.
=300 \\$a243 [1] p.$c23 cm.
=533 \\$aAlso available as electronic reproduction.$bChicago :$cCenter for
Research Libraries,$d[2009]
=650 \0$aJewish question.
=700 12$aBierbrauer, Johann Jacob,$d1705-1760?
=710 2\$aCenter for Research Libraries (U.S.)
=856 41$uhttp://dds.crl.edu/CRLdelivery.asp?tid=10538$zOnline version
=907 \\$a.b28931622$b08-30-10$c08-30-10
=998 \\$awww$b08-30-10$cm$dz$e-$fger$ggw $h4$i0
The leader[9] field is set to 'a', so the record should contain
UTF8-encoded Unicode [1], but E8 75 in the 245$a appears to be ANSEL where
'E8' denotes the Umlaut preceding the lowercase 'u' (0x75). [2]
To me, this record looks misencoded... am I correct here? There are
thousands of such records in the data set I'm dealing with, which was
obtained using the 'Data Exchange' feature of III's Millennium system.
My question is how others, especially pymarc users dealing with III
records, deal with this issue or whatever other
experiences/hints/practices/kludges exist in this area.
Thanks.
- Godmar
[1] http://www.loc.gov/marc/bibliographic/bdleader.html
[2] http://lcweb2.loc.gov/diglib/codetables/45.html
|