LISTSERV 16.5 - CODE4LIB Archives

I ended up doing a regular expression find and replace function to replace all illegal xml characters with a dash or something. I was more disappointed in the fact that on the xml creation end, minidom was able to create non-compliant xml files. I assumed that if minidom could make it, it would be compliant but that doesn't seem to be the case. Now I have to add a find and replace function on the creation side to avoid this issue in the future. Good learning experience I guess. Thanks for all your suggestions.

Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
[log in to unmask]
Become a friend of Paul Smith's Library on Facebook today!


-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Chris Beer
Sent: Tuesday, March 05, 2013 1:48 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] XML Parsing and Python

I'll note that 0xFFFF is a UTF-8 non-character, and " these noncharacters should never be included in text interchange between implementations." [1] I assume the OCR engine maybe using 0xFFFF when it can't recognize a character? So, it's not wrong for a parser to complain (or, not complain) about 0xFFFF, and you can just scrub the string like Jon suggests.

Chris


[1] http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters

On 5 Mar, 2013, at 9:16 , Jon Stroop <[log in to unmask]> wrote:

> Mike,
> I haven't used minidom extensively but my guess is that doc.toprettyxml(indent=" ",encoding="utf-8") isn't actually changing the encoding because it can't parse the string in your content variable. I'm surprised that you're not getting tossed a UnicodeError, but The docs for Node.toxml() [1] might shed some light:
> 
>> To avoid UnicodeError exceptions in case of unrepresentable text data, the encoding argument should be specified as "utf-8".
> 
> So what happens if you're not explicit about the encoding, i.e. just doc.toprettyxml()? This would hopefully at least move your exception to a more appropriate place.
> 
> In any case, one solution would be to scrub the string in your content variable to get rid of the invalid characters (hopefully they're insignificant). Maybe something like this:
> 
> def unicode_filter(char):
>    try:
>        unicode(char, encoding='utf-8', errors='strict')
>        return char
>    except UnicodeDecodeError:
>        return ''
> 
> content = 'abc\xFF'
> content = ''.join(map(unicode_filter, content)) print content
> 
> Not really my area of expertise, but maybe worth a shot....
> -Jon
> 
> 1. 
> http://docs.python.org/2/library/xml.dom.minidom.html#xml.dom.minidom.
> Node.toxml
> 
> --
> Jon Stroop
> Digital Initiatives Programmer/Analyst Princeton University Library 
> [log in to unmask]
> 
> 
> 
> 
> On 03/04/2013 03:00 PM, Michael Beccaria wrote:
>> I'm working on a project that takes the ocr data found in a pdf and places it in a custom xml file.
>> 
>> I use Python scripts to create the xml file. Something like this (trimmed down a bit):
>> 
>> from xml.dom.minidom import Document
>> doc = Document()
>> 	Page = doc.createElement("Page")
>> 	doc.appendChild(Page)
>> 	f = StringIO(txt)
>> 	lines = f.readlines()
>> 	for line in lines:
>> 	word = doc.createElement("String")
>> 		...
>> 		word.setAttribute("CONTENT",content)
>> 		Page.appendChild(word)
>> 	return doc.toprettyxml(indent="  ",encoding="utf-8")	
>> 
>> 
>> This creates a file, simply, that looks like this:
>> <?xml version="1.0" encoding="utf-8"?> <Page HEIGHT="3296" 
>> WIDTH="2609">
>>   <String CONTENT="BuffaloLaunch" />
>>   <String CONTENT="Club" />
>>   <String CONTENT="Offices" />
>>   <String CONTENT="Installed" />
>>   ...
>> </Page>
>> 
>> I am able to get this document to be created ok and saved to an xml file. The problem occurs when I try and have it read using the lxml library:
>> 
>> from lxml import etree
>> doc = etree.parse(filename)
>> 
>> 
>> I am running across errors like "XMLSyntaxError: Char 0xFFFF out of allowed range, line 94, column 19". Which when I look at the file, is true. There is a 0XFFFF character in the content field.
>> 
>> How is a file able to be created using minidom (which I assume would create a valid xml file) and then failing when parsing with lxml? What should I do to fix this on the encoding side so that errors don't show up on the parsing side?
>> Thanks,
>> Mike
>> 
>> How is the
>> Mike Beccaria
>> Systems Librarian
>> Head of Digital Initiative
>> Paul Smith's College
>> 518.327.6376
>> [log in to unmask]
>> Become a friend of Paul Smith's Library on Facebook today!