It sounds like your code isn't recognizing the XML file as UTF-8 (even
though the encoding is correctly marked in your example).
You could try telling the parser explicitly to use UTF-8, like this
parser = XMLParser(encoding="utf-8")
As discussed in
http://www.daniweb.com/software-development/python/threads/435360/using-xml.etree-with-xml-files-containing-a-symbol
There's also a bit of discussion about using lxml to parse UTF-8 in
http://stackoverflow.com/questions/3402520/is-there-a-way-to-force-lxml-to-parse-unicode-strings-that-specify-an-encoding-i
Hope this helps!
Regards,
Stuart
On Mon, Mar 4, 2013 at 3:00 PM, Michael Beccaria
<[log in to unmask]>wrote:
> I'm working on a project that takes the ocr data found in a pdf and places
> it in a custom xml file.
>
> I use Python scripts to create the xml file. Something like this (trimmed
> down a bit):
>
> from xml.dom.minidom import Document
> doc = Document()
> Page = doc.createElement("Page")
> doc.appendChild(Page)
> f = StringIO(txt)
> lines = f.readlines()
> for line in lines:
> word = doc.createElement("String")
> ...
> word.setAttribute("CONTENT",content)
> Page.appendChild(word)
> return doc.toprettyxml(indent=" ",encoding="utf-8")
>
>
> This creates a file, simply, that looks like this:
> <?xml version="1.0" encoding="utf-8"?>
> <Page HEIGHT="3296" WIDTH="2609">
> <String CONTENT="BuffaloLaunch" />
> <String CONTENT="Club" />
> <String CONTENT="Offices" />
> <String CONTENT="Installed" />
> ...
> </Page>
>
> I am able to get this document to be created ok and saved to an xml file.
> The problem occurs when I try and have it read using the lxml library:
>
> from lxml import etree
> doc = etree.parse(filename)
>
>
> I am running across errors like "XMLSyntaxError: Char 0xFFFF out of
> allowed range, line 94, column 19". Which when I look at the file, is true.
> There is a 0XFFFF character in the content field.
>
> How is a file able to be created using minidom (which I assume would
> create a valid xml file) and then failing when parsing with lxml? What
> should I do to fix this on the encoding side so that errors don't show up
> on the parsing side?
> Thanks,
> Mike
>
> How is the
> Mike Beccaria
> Systems Librarian
> Head of Digital Initiative
> Paul Smith's College
> 518.327.6376
> [log in to unmask]
> Become a friend of Paul Smith's Library on Facebook today!
>
|