LISTSERV 16.5 - CODE4LIB Archives

Hi Godmar,
Using something similar to Jonathan's suggestion , I use the Python's encode string method call 'replace' or 'ignore' options (I don't know the exact heuristics behind these optins)  when encountering similar issues while automating MARC records into III and then exporting and indexing into our Solr with Django-based Discovery layer.

 Is your intent to be able to reimport the MARC output records into III? I was able to get around this problem using the following code snippet that replicates pymarc's MARCWrite.write method:

output = cStringIO.StringIO() # You could replace this with a File Object, this code returns a cStringIO to calling web request using Django
for record in marc_records: # Saved list or you could use a pymarc.MARCReader instance
   record_str = record.as_marc()
   output.write(record_str.encode('utf8','replace'))
output.close()
return output 

This ultimately failed when trying to re-import these records into Millennium because we haven't set-up UTF8 in our III instance (which I've been ignoring as we shift more of bibliographic record management to Redis here at Colorado College).

Jeremy

-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Jonathan Rochkind
Sent: Thursday, March 08, 2012 1:51 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

Oh, and why do I favor this solution?

Compared to passing input through as is:  You're just prolonging the pain, something downstream is still going to have a problem with it, outputting known illegal data is not a good idea.

Compared to heuristically guessing encoding: Heuristically guessing is okay, but obviously a good deal harder than just replacing bad data with unicode 'replacement' glyph.  But honestly, I don't _want_ this kind of mis-encoded data to be completely transparent -- I want it to do something to make the error visible (without stopping the app or data transformation process in it's tracks), so catalogers can't possibly think that the data is just fine.  If you use heuristics to guess, sometimes those heuristics will fail -- when they do, the catalogers will think there's something wrong with your logic. "But it works fine for all the other records that you say have the same problem, why can't it work fine for this one?"  But this is partially as a result of my general conclusions, from experience, about trying to heuristically 'autocorrect' bad marc data -- I try to do it as minimally as possible. 
It's too easy to get in a long battle with trying to make your heuristics better, instead of focusing on, you know, actually fixing the data.

Now, a place where i'd be willing to use heuristics -- a bulk process to try to actually fix the data in your ILS. Something that goes through all your marc and flags records that aren't legal for the encoding they claim to be. If you want to add heuristics there to try to guess what encoding they really are and automatically fix em, that doesn't seem a terrible idea to me.  But working around the problem with heuristics at higher levels does; spend time on actually fixing the bad data instead.  
Bad marc data, including illegal char encodings, is a continual inconvenience, you work around it in your pymarc-based software, eventually you'll have some other software in a different language that you have to duplicate your workarounds in.

On 3/8/2012 3:45 PM, Jonathan Rochkind wrote:
> a) Mis-characterized MARC char encodings are common amongst many of 
> our corpuses and ILS's. It is a common problem. It can be very 
> inconvenient. Not only Marc8 that says it's UTF8 and vice versa, but 
> something that says it's MARC8 or UTF8 but is actually neither.
>
> b) While one solution would be having the marc tool pass the char 
> stream through as is without complaining like Godmar suggested; and 
> another solution would be trying to heuristically guess the 'real'
> solution like Gabe suggests;  personally I favor a different solution:
>
> The thing that's encoding as unicode on the way out?  Instead of 
> raising on an invalid char, it should have the option of silently 
> eating it, replacing it with either empty string or the unicode 
> "replacement character" ( "used to replace an incoming character whose 
> value is unknown or unrepresentable in Unicode"
> [http://www.fileformat.info/info/unicode/char/fffd/index.htm] )
>
> I have worked with character encoding libraries before that have this 
> option, replace messed up bytes with unicode replacement char. I don't 
> know what's avail in Python though.
>
> Jonathan
>
> On 3/8/2012 3:19 PM, Gabriel Farrell wrote:
>> Sounds like what you do, Terry, and what we need in PyMARC, is 
>> something like UnicodeDammit [0]. Actually handling all of these 
>> esoteric encodings would be quite the chore, though.
>>
>> I also used to think it would be cool if we could get MARC8 
>> encoding/decoding into the Python standard library, but then I 
>> realized I'd rather work on other stuff while MARC8 withers and dies.
>>
>>
>> [0]
>> https://github.com/bdoms/beautifulsoup/blob/master/BeautifulSoup.py#L
>> 1753
>>
>> On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry 
>> <[log in to unmask]>  wrote:
>>> This is one of the reasons you really can't trust the information 
>>> found in position 9.  This is one of the reasons why when I wrote 
>>> MarcEdit, I utilize a mixed process when working with data and 
>>> determining characterset -- a process that reads this byte and takes 
>>> the information under advisement, but in the end treats it more as a 
>>> suggestion and one part of a larger heuristic analysis of the record 
>>> data to determine whether the information is in UTF8 or not.
>>> Fortunately, determining if a set of data is in UTF8 or something 
>>> else, is a fairly easy process.  Determining the something else is 
>>> much more difficult, but generally not necessary.
>>>
>>> For that reason, if I was advising other people working on MARC 
>>> processing libraries, I'd advocate having a process for recognizing 
>>> that certain informational data may not be set correctly, and 
>>> essentially utilize a compatibility process to read and correct 
>>> them.  Because unfortunately, while the number of vendors and 
>>> systems that set this encoding byte correctly has increased 
>>> dramatically (it used to be pretty much no one) -- but it's still so 
>>> uneven, I generally consider this information unreliable.
>>>
>>> --TR
>>>
>>> -----Original Message-----
>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf 
>>> Of Godmar Back
>>> Sent: Thursday, March 08, 2012 11:01 AM
>>> To: [log in to unmask]
>>> Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and 
>>> misencoded III records
>>>
>>> On Thu, Mar 8, 2012 at 1:46 PM, Terray, James<[log in to unmask]>  
>>> wrote:
>>>
>>>> Hi Godmar,
>>>>
>>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in 
>>>> position 9:
>>>> ordinal not in range(128)
>>>>
>>>> Having seen my fair share of these kinds of encoding errors in 
>>>> Python, I can speculate (without seeing the pymarc source code, so 
>>>> please don't hold me to this) that it's the Python code that's not 
>>>> set up to handle the UTF-8 strings from your data source. In fact, 
>>>> the error indicates it's using the default 'ascii' codec rather 
>>>> than 'utf-8'. If it said "'utf-8' codec can't decode...", then I'd 
>>>> suspect a problem with the data.
>>>>
>>>> If you were to send the full traceback (all the gobbledy-gook that 
>>>> Python spews when it encounters an error) and the version of pymarc 
>>>> you're using to the program's author(s), they may be able to help 
>>>> you out further.
>>>>
>>>>
>>> My question is less about the Python error, which I understand, than 
>>> about the MARC record causing the error and about how others deal 
>>> with this issue (if it's a common issue, which I do not know.)
>>>
>>> But, here's the long story from pymarc's perspective.
>>>
>>> The record has leader[9] == 'a', but really, truly contains 
>>> ANSEL-encoded data.  When reading the record with a 
>>> MARCReader(to_unicode = False) instance, the record reads ok since 
>>> no decoding is attempted, but attempts at writing the record fail 
>>> with the above error since pymarc attempts to
>>> utf8 encode the ANSEL-encoded string which contains non-ascii chars 
>>> such as
>>> 0xe8 (the ANSEL Umlaut prefix). It does so because leader[9] == 'a' 
>>> (see [1]).
>>>
>>> When reading the record with a MARCReader(to_unicode=True) instance, 
>>> it'll throw an exception during marc_decode when trying to 
>>> utf8-decode the ANSEL-encoded string. Rightly so.
>>>
>>> I don't blame pymarc for this behavior; to me, the record looks wrong.
>>>
>>>   - Godmar
>>>
>>> (ps: that said, what pymarc does fails in different circumstances - 
>>> from what I can see, pymarc shouldn't assume that it's ok to 
>>> utf8-encode the field data if leader[9] is 'a'.  For instance, this 
>>> would double-encode correctly encoded Marc/Unicode records that were 
>>> read with a
>>> MARCReader(to_unicode=False) instance. But that's a separate issue 
>>> that is not my immediate concern. pymarc should probably remember if 
>>> a record needs or does not need encoding when writing it, rather 
>>> than consulting the leader[9] field.)
>>>
>>>
>>> (*)
>>> https://github.com/mbklein/pymarc/commit/ff312861096ecaa527d210836db
>>> ef904c24baee6
>>>