I ended up doing a regular expression find and replace function to replace all illegal xml characters with a dash or something. I was more disappointed in the fact that on the xml creation end, minidom was able to create non-compliant xml files. I assumed that if minidom could make it, it would be compliant but that doesn't seem to be the case. Now I have to add a find and replace function on the creation side to avoid this issue in the future. Good learning experience I guess. Thanks for all your suggestions.
Head of Digital Initiative
Paul Smith's College
[log in to unmask]
Become a friend of Paul Smith's Library on Facebook today!
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Chris Beer
Sent: Tuesday, March 05, 2013 1:48 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] XML Parsing and Python
I'll note that 0xFFFF is a UTF-8 non-character, and " these noncharacters should never be included in text interchange between implementations."  I assume the OCR engine maybe using 0xFFFF when it can't recognize a character? So, it's not wrong for a parser to complain (or, not complain) about 0xFFFF, and you can just scrub the string like Jon suggests.
On 5 Mar, 2013, at 9:16 , Jon Stroop <[log in to unmask]> wrote:
> I haven't used minidom extensively but my guess is that doc.toprettyxml(indent=" ",encoding="utf-8") isn't actually changing the encoding because it can't parse the string in your content variable. I'm surprised that you're not getting tossed a UnicodeError, but The docs for Node.toxml()  might shed some light:
>> To avoid UnicodeError exceptions in case of unrepresentable text data, the encoding argument should be specified as "utf-8".
> So what happens if you're not explicit about the encoding, i.e. just doc.toprettyxml()? This would hopefully at least move your exception to a more appropriate place.
> In any case, one solution would be to scrub the string in your content variable to get rid of the invalid characters (hopefully they're insignificant). Maybe something like this:
> def unicode_filter(char):
> unicode(char, encoding='utf-8', errors='strict')
> return char
> except UnicodeDecodeError:
> return ''
> content = 'abc\xFF'
> content = ''.join(map(unicode_filter, content)) print content
> Not really my area of expertise, but maybe worth a shot....
> Jon Stroop
> Digital Initiatives Programmer/Analyst Princeton University Library
> [log in to unmask]
> On 03/04/2013 03:00 PM, Michael Beccaria wrote:
>> I'm working on a project that takes the ocr data found in a pdf and places it in a custom xml file.
>> I use Python scripts to create the xml file. Something like this (trimmed down a bit):
>> from xml.dom.minidom import Document
>> doc = Document()
>> Page = doc.createElement("Page")
>> f = StringIO(txt)
>> lines = f.readlines()
>> for line in lines:
>> word = doc.createElement("String")
>> return doc.toprettyxml(indent=" ",encoding="utf-8")
>> This creates a file, simply, that looks like this:
>> <?xml version="1.0" encoding="utf-8"?> <Page HEIGHT="3296"
>> <String CONTENT="BuffaloLaunch" />
>> <String CONTENT="Club" />
>> <String CONTENT="Offices" />
>> <String CONTENT="Installed" />
>> I am able to get this document to be created ok and saved to an xml file. The problem occurs when I try and have it read using the lxml library:
>> from lxml import etree
>> doc = etree.parse(filename)
>> I am running across errors like "XMLSyntaxError: Char 0xFFFF out of allowed range, line 94, column 19". Which when I look at the file, is true. There is a 0XFFFF character in the content field.
>> How is a file able to be created using minidom (which I assume would create a valid xml file) and then failing when parsing with lxml? What should I do to fix this on the encoding side so that errors don't show up on the parsing side?
>> How is the
>> Mike Beccaria
>> Systems Librarian
>> Head of Digital Initiative
>> Paul Smith's College
>> [log in to unmask]
>> Become a friend of Paul Smith's Library on Facebook today!