I'll note that 0xFFFF is a UTF-8 non-character, and " these noncharacters should never be included in text interchange between implementations." [1] I assume the OCR engine maybe using 0xFFFF when it can't recognize a character? So, it's not wrong for a parser to complain (or, not complain) about 0xFFFF, and you can just scrub the string like Jon suggests.
Chris
[1] http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters
On 5 Mar, 2013, at 9:16 , Jon Stroop <[log in to unmask]> wrote:
> Mike,
> I haven't used minidom extensively but my guess is that doc.toprettyxml(indent=" ",encoding="utf-8") isn't actually changing the encoding because it can't parse the string in your content variable. I'm surprised that you're not getting tossed a UnicodeError, but The docs for Node.toxml() [1] might shed some light:
>
>> To avoid UnicodeError exceptions in case of unrepresentable text data, the encoding argument should be specified as “utf-8”.
>
> So what happens if you're not explicit about the encoding, i.e. just doc.toprettyxml()? This would hopefully at least move your exception to a more appropriate place.
>
> In any case, one solution would be to scrub the string in your content variable to get rid of the invalid characters (hopefully they're insignificant). Maybe something like this:
>
> def unicode_filter(char):
> try:
> unicode(char, encoding='utf-8', errors='strict')
> return char
> except UnicodeDecodeError:
> return ''
>
> content = 'abc\xFF'
> content = ''.join(map(unicode_filter, content))
> print content
>
> Not really my area of expertise, but maybe worth a shot....
> -Jon
>
> 1. http://docs.python.org/2/library/xml.dom.minidom.html#xml.dom.minidom.Node.toxml
>
> --
> Jon Stroop
> Digital Initiatives Programmer/Analyst
> Princeton University Library
> [log in to unmask]
>
>
>
>
> On 03/04/2013 03:00 PM, Michael Beccaria wrote:
>> I'm working on a project that takes the ocr data found in a pdf and places it in a custom xml file.
>>
>> I use Python scripts to create the xml file. Something like this (trimmed down a bit):
>>
>> from xml.dom.minidom import Document
>> doc = Document()
>> Page = doc.createElement("Page")
>> doc.appendChild(Page)
>> f = StringIO(txt)
>> lines = f.readlines()
>> for line in lines:
>> word = doc.createElement("String")
>> ...
>> word.setAttribute("CONTENT",content)
>> Page.appendChild(word)
>> return doc.toprettyxml(indent=" ",encoding="utf-8")
>>
>>
>> This creates a file, simply, that looks like this:
>> <?xml version="1.0" encoding="utf-8"?>
>> <Page HEIGHT="3296" WIDTH="2609">
>> <String CONTENT="BuffaloLaunch" />
>> <String CONTENT="Club" />
>> <String CONTENT="Offices" />
>> <String CONTENT="Installed" />
>> ...
>> </Page>
>>
>> I am able to get this document to be created ok and saved to an xml file. The problem occurs when I try and have it read using the lxml library:
>>
>> from lxml import etree
>> doc = etree.parse(filename)
>>
>>
>> I am running across errors like "XMLSyntaxError: Char 0xFFFF out of allowed range, line 94, column 19". Which when I look at the file, is true. There is a 0XFFFF character in the content field.
>>
>> How is a file able to be created using minidom (which I assume would create a valid xml file) and then failing when parsing with lxml? What should I do to fix this on the encoding side so that errors don't show up on the parsing side?
>> Thanks,
>> Mike
>>
>> How is the
>> Mike Beccaria
>> Systems Librarian
>> Head of Digital Initiative
>> Paul Smith's College
>> 518.327.6376
>> [log in to unmask]
>> Become a friend of Paul Smith's Library on Facebook today!
|