LISTSERV 16.5 - CODE4LIB Archives

On Thu, Mar 8, 2012 at 1:46 PM, Terray, James <[log in to unmask]> wrote:

> Hi Godmar,
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9:
> ordinal not in range(128)
>
> Having seen my fair share of these kinds of encoding errors in Python, I
> can speculate (without seeing the pymarc source code, so please don't hold
> me to this) that it's the Python code that's not set up to handle the UTF-8
> strings from your data source. In fact, the error indicates it's using the
> default 'ascii' codec rather than 'utf-8'. If it said "'utf-8' codec can't
> decode...", then I'd suspect a problem with the data.
>
> If you were to send the full traceback (all the gobbledy-gook that Python
> spews when it encounters an error) and the version of pymarc you're using
> to the program's author(s), they may be able to help you out further.
>
>
My question is less about the Python error, which I understand, than about
the MARC record causing the error and about how others deal with this issue
(if it's a common issue, which I do not know.)

But, here's the long story from pymarc's perspective.

The record has leader[9] == 'a', but really, truly contains ANSEL-encoded
data.  When reading the record with a MARCReader(to_unicode = False)
instance, the record reads ok since no decoding is attempted, but attempts
at writing the record fail with the above error since pymarc attempts to
utf8 encode the ANSEL-encoded string which contains non-ascii chars such as
0xe8 (the ANSEL Umlaut prefix). It does so because leader[9] == 'a' (see
[1]).

When reading the record with a MARCReader(to_unicode=True) instance, it'll
throw an exception during marc_decode when trying to utf8-decode the
ANSEL-encoded string. Rightly so.

I don't blame pymarc for this behavior; to me, the record looks wrong.

 - Godmar

(ps: that said, what pymarc does fails in different circumstances - from
what I can see, pymarc shouldn't assume that it's ok to utf8-encode the
field data if leader[9] is 'a'.  For instance, this would double-encode
correctly encoded Marc/Unicode records that were read with a
MARCReader(to_unicode=False) instance. But that's a separate issue that is
not my immediate concern. pymarc should probably remember if a record needs
or does not need encoding when writing it, rather than consulting the
leader[9] field.)


(*)
https://github.com/mbklein/pymarc/commit/ff312861096ecaa527d210836dbef904c24baee6