It sounds like your code isn't recognizing the XML file as UTF-8 (even
though the encoding is correctly marked in your example).
You could try telling the parser explicitly to use UTF-8, like this
parser = XMLParser(encoding="utf-8")
As discussed in
There's also a bit of discussion about using lxml to parse UTF-8 in
Hope this helps!
On Mon, Mar 4, 2013 at 3:00 PM, Michael Beccaria
<[log in to unmask]>wrote:
> I'm working on a project that takes the ocr data found in a pdf and places
> it in a custom xml file.
> I use Python scripts to create the xml file. Something like this (trimmed
> down a bit):
> from xml.dom.minidom import Document
> doc = Document()
> Page = doc.createElement("Page")
> f = StringIO(txt)
> lines = f.readlines()
> for line in lines:
> word = doc.createElement("String")
> return doc.toprettyxml(indent=" ",encoding="utf-8")
> This creates a file, simply, that looks like this:
> <?xml version="1.0" encoding="utf-8"?>
> <Page HEIGHT="3296" WIDTH="2609">
> <String CONTENT="BuffaloLaunch" />
> <String CONTENT="Club" />
> <String CONTENT="Offices" />
> <String CONTENT="Installed" />
> I am able to get this document to be created ok and saved to an xml file.
> The problem occurs when I try and have it read using the lxml library:
> from lxml import etree
> doc = etree.parse(filename)
> I am running across errors like "XMLSyntaxError: Char 0xFFFF out of
> allowed range, line 94, column 19". Which when I look at the file, is true.
> There is a 0XFFFF character in the content field.
> How is a file able to be created using minidom (which I assume would
> create a valid xml file) and then failing when parsing with lxml? What
> should I do to fix this on the encoding side so that errors don't show up
> on the parsing side?
> How is the
> Mike Beccaria
> Systems Librarian
> Head of Digital Initiative
> Paul Smith's College
> [log in to unmask]
> Become a friend of Paul Smith's Library on Facebook today!