On Thu, Mar 8, 2012 at 1:46 PM, Terray, James <[log in to unmask]> wrote: > Hi Godmar, > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9: > ordinal not in range(128) > > Having seen my fair share of these kinds of encoding errors in Python, I > can speculate (without seeing the pymarc source code, so please don't hold > me to this) that it's the Python code that's not set up to handle the UTF-8 > strings from your data source. In fact, the error indicates it's using the > default 'ascii' codec rather than 'utf-8'. If it said "'utf-8' codec can't > decode...", then I'd suspect a problem with the data. > > If you were to send the full traceback (all the gobbledy-gook that Python > spews when it encounters an error) and the version of pymarc you're using > to the program's author(s), they may be able to help you out further. > > My question is less about the Python error, which I understand, than about the MARC record causing the error and about how others deal with this issue (if it's a common issue, which I do not know.) But, here's the long story from pymarc's perspective. The record has leader[9] == 'a', but really, truly contains ANSEL-encoded data. When reading the record with a MARCReader(to_unicode = False) instance, the record reads ok since no decoding is attempted, but attempts at writing the record fail with the above error since pymarc attempts to utf8 encode the ANSEL-encoded string which contains non-ascii chars such as 0xe8 (the ANSEL Umlaut prefix). It does so because leader[9] == 'a' (see [1]). When reading the record with a MARCReader(to_unicode=True) instance, it'll throw an exception during marc_decode when trying to utf8-decode the ANSEL-encoded string. Rightly so. I don't blame pymarc for this behavior; to me, the record looks wrong. - Godmar (ps: that said, what pymarc does fails in different circumstances - from what I can see, pymarc shouldn't assume that it's ok to utf8-encode the field data if leader[9] is 'a'. For instance, this would double-encode correctly encoded Marc/Unicode records that were read with a MARCReader(to_unicode=False) instance. But that's a separate issue that is not my immediate concern. pymarc should probably remember if a record needs or does not need encoding when writing it, rather than consulting the leader[9] field.) (*) https://github.com/mbklein/pymarc/commit/ff312861096ecaa527d210836dbef904c24baee6