>> I also used to think it would be cool if we could get MARC8 encoding/decoding into the Python standard library, but then I realized I'd rather work on other stuff while MARC8 withers and dies.
Wouldn't that be nice. In MarcEdit, all data wants to be treated as UTF8, MARC8 support is there as a legacy. Which is why processing MARC8 data in MarcEdit is slightly slower than UTF8 (because there is a kind of emulation that occurs to translate charactersets on the fly when needed).
--TR
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Gabriel Farrell
Sent: Thursday, March 08, 2012 12:19 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records
Sounds like what you do, Terry, and what we need in PyMARC, is something like UnicodeDammit [0]. Actually handling all of these esoteric encodings would be quite the chore, though.
I also used to think it would be cool if we could get MARC8 encoding/decoding into the Python standard library, but then I realized I'd rather work on other stuff while MARC8 withers and dies.
[0] https://github.com/bdoms/beautifulsoup/blob/master/BeautifulSoup.py#L1753
On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry <[log in to unmask]> wrote:
> This is one of the reasons you really can't trust the information found in position 9. This is one of the reasons why when I wrote MarcEdit, I utilize a mixed process when working with data and determining characterset -- a process that reads this byte and takes the information under advisement, but in the end treats it more as a suggestion and one part of a larger heuristic analysis of the record data to determine whether the information is in UTF8 or not. Fortunately, determining if a set of data is in UTF8 or something else, is a fairly easy process. Determining the something else is much more difficult, but generally not necessary.
>
> For that reason, if I was advising other people working on MARC processing libraries, I'd advocate having a process for recognizing that certain informational data may not be set correctly, and essentially utilize a compatibility process to read and correct them. Because unfortunately, while the number of vendors and systems that set this encoding byte correctly has increased dramatically (it used to be pretty much no one) -- but it's still so uneven, I generally consider this information unreliable.
>
> --TR
>
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> Of Godmar Back
> Sent: Thursday, March 08, 2012 11:01 AM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and
> misencoded III records
>
> On Thu, Mar 8, 2012 at 1:46 PM, Terray, James <[log in to unmask]> wrote:
>
>> Hi Godmar,
>>
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9:
>> ordinal not in range(128)
>>
>> Having seen my fair share of these kinds of encoding errors in
>> Python, I can speculate (without seeing the pymarc source code, so
>> please don't hold me to this) that it's the Python code that's not
>> set up to handle the UTF-8 strings from your data source. In fact,
>> the error indicates it's using the default 'ascii' codec rather than
>> 'utf-8'. If it said "'utf-8' codec can't decode...", then I'd suspect a problem with the data.
>>
>> If you were to send the full traceback (all the gobbledy-gook that
>> Python spews when it encounters an error) and the version of pymarc
>> you're using to the program's author(s), they may be able to help you out further.
>>
>>
> My question is less about the Python error, which I understand, than
> about the MARC record causing the error and about how others deal with
> this issue (if it's a common issue, which I do not know.)
>
> But, here's the long story from pymarc's perspective.
>
> The record has leader[9] == 'a', but really, truly contains
> ANSEL-encoded data. When reading the record with a
> MARCReader(to_unicode = False) instance, the record reads ok since no
> decoding is attempted, but attempts at writing the record fail with
> the above error since pymarc attempts to
> utf8 encode the ANSEL-encoded string which contains non-ascii chars
> such as
> 0xe8 (the ANSEL Umlaut prefix). It does so because leader[9] == 'a' (see [1]).
>
> When reading the record with a MARCReader(to_unicode=True) instance, it'll throw an exception during marc_decode when trying to utf8-decode the ANSEL-encoded string. Rightly so.
>
> I don't blame pymarc for this behavior; to me, the record looks wrong.
>
> - Godmar
>
> (ps: that said, what pymarc does fails in different circumstances -
> from what I can see, pymarc shouldn't assume that it's ok to
> utf8-encode the field data if leader[9] is 'a'. For instance, this
> would double-encode correctly encoded Marc/Unicode records that were
> read with a
> MARCReader(to_unicode=False) instance. But that's a separate issue
> that is not my immediate concern. pymarc should probably remember if a
> record needs or does not need encoding when writing it, rather than
> consulting the leader[9] field.)
>
>
> (*)
> https://github.com/mbklein/pymarc/commit/ff312861096ecaa527d210836dbef
> 904c24baee6
|