LISTSERV 16.5 - CODE4LIB Archives

I'll note that 0xFFFF is a UTF-8 non-character, and " these noncharacters should never be included in text interchange between implementations." [1] I assume the OCR engine maybe using 0xFFFF when it can't recognize a character? So, it's not wrong for a parser to complain (or, not complain) about 0xFFFF, and you can just scrub the string like Jon suggests.

Chris


[1] http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters

On 5 Mar, 2013, at 9:16 , Jon Stroop <[log in to unmask]> wrote:

> Mike,
> I haven't used minidom extensively but my guess is that doc.toprettyxml(indent=" ",encoding="utf-8") isn't actually changing the encoding because it can't parse the string in your content variable. I'm surprised that you're not getting tossed a UnicodeError, but The docs for Node.toxml() [1] might shed some light:
> 
>> To avoid UnicodeError exceptions in case of unrepresentable text data, the encoding argument should be specified as “utf-8”.
> 
> So what happens if you're not explicit about the encoding, i.e. just doc.toprettyxml()? This would hopefully at least move your exception to a more appropriate place.
> 
> In any case, one solution would be to scrub the string in your content variable to get rid of the invalid characters (hopefully they're insignificant). Maybe something like this:
> 
> def unicode_filter(char):
>    try:
>        unicode(char, encoding='utf-8', errors='strict')
>        return char
>    except UnicodeDecodeError:
>        return ''
> 
> content = 'abc\xFF'
> content = ''.join(map(unicode_filter, content))
> print content
> 
> Not really my area of expertise, but maybe worth a shot....
> -Jon
> 
> 1. http://docs.python.org/2/library/xml.dom.minidom.html#xml.dom.minidom.Node.toxml
> 
> -- 
> Jon Stroop
> Digital Initiatives Programmer/Analyst
> Princeton University Library
> [log in to unmask]
> 
> 
> 
> 
> On 03/04/2013 03:00 PM, Michael Beccaria wrote:
>> I'm working on a project that takes the ocr data found in a pdf and places it in a custom xml file.
>> 
>> I use Python scripts to create the xml file. Something like this (trimmed down a bit):
>> 
>> from xml.dom.minidom import Document
>> doc = Document()
>> 	Page = doc.createElement("Page")
>> 	doc.appendChild(Page)
>> 	f = StringIO(txt)
>> 	lines = f.readlines()
>> 	for line in lines:
>> 	word = doc.createElement("String")
>> 		...
>> 		word.setAttribute("CONTENT",content)
>> 		Page.appendChild(word)
>> 	return doc.toprettyxml(indent="  ",encoding="utf-8")	
>> 
>> 
>> This creates a file, simply, that looks like this:
>> <?xml version="1.0" encoding="utf-8"?>
>> <Page HEIGHT="3296" WIDTH="2609">
>>   <String CONTENT="BuffaloLaunch" />
>>   <String CONTENT="Club" />
>>   <String CONTENT="Offices" />
>>   <String CONTENT="Installed" />
>>   ...
>> </Page>
>> 
>> I am able to get this document to be created ok and saved to an xml file. The problem occurs when I try and have it read using the lxml library:
>> 
>> from lxml import etree
>> doc = etree.parse(filename)
>> 
>> 
>> I am running across errors like "XMLSyntaxError: Char 0xFFFF out of allowed range, line 94, column 19". Which when I look at the file, is true. There is a 0XFFFF character in the content field.
>> 
>> How is a file able to be created using minidom (which I assume would create a valid xml file) and then failing when parsing with lxml? What should I do to fix this on the encoding side so that errors don't show up on the parsing side?
>> Thanks,
>> Mike
>> 
>> How is the
>> Mike Beccaria
>> Systems Librarian
>> Head of Digital Initiative
>> Paul Smith's College
>> 518.327.6376
>> [log in to unmask]
>> Become a friend of Paul Smith's Library on Facebook today!