It sounds like your code isn't recognizing the XML file as UTF-8 (even though the encoding is correctly marked in your example). You could try telling the parser explicitly to use UTF-8, like this parser = XMLParser(encoding="utf-8") As discussed in http://www.daniweb.com/software-development/python/threads/435360/using-xml.etree-with-xml-files-containing-a-symbol There's also a bit of discussion about using lxml to parse UTF-8 in http://stackoverflow.com/questions/3402520/is-there-a-way-to-force-lxml-to-parse-unicode-strings-that-specify-an-encoding-i Hope this helps! Regards, Stuart On Mon, Mar 4, 2013 at 3:00 PM, Michael Beccaria <[log in to unmask]>wrote: > I'm working on a project that takes the ocr data found in a pdf and places > it in a custom xml file. > > I use Python scripts to create the xml file. Something like this (trimmed > down a bit): > > from xml.dom.minidom import Document > doc = Document() > Page = doc.createElement("Page") > doc.appendChild(Page) > f = StringIO(txt) > lines = f.readlines() > for line in lines: > word = doc.createElement("String") > ... > word.setAttribute("CONTENT",content) > Page.appendChild(word) > return doc.toprettyxml(indent=" ",encoding="utf-8") > > > This creates a file, simply, that looks like this: > <?xml version="1.0" encoding="utf-8"?> > <Page HEIGHT="3296" WIDTH="2609"> > <String CONTENT="BuffaloLaunch" /> > <String CONTENT="Club" /> > <String CONTENT="Offices" /> > <String CONTENT="Installed" /> > ... > </Page> > > I am able to get this document to be created ok and saved to an xml file. > The problem occurs when I try and have it read using the lxml library: > > from lxml import etree > doc = etree.parse(filename) > > > I am running across errors like "XMLSyntaxError: Char 0xFFFF out of > allowed range, line 94, column 19". Which when I look at the file, is true. > There is a 0XFFFF character in the content field. > > How is a file able to be created using minidom (which I assume would > create a valid xml file) and then failing when parsing with lxml? What > should I do to fix this on the encoding side so that errors don't show up > on the parsing side? > Thanks, > Mike > > How is the > Mike Beccaria > Systems Librarian > Head of Digital Initiative > Paul Smith's College > 518.327.6376 > [log in to unmask] > Become a friend of Paul Smith's Library on Facebook today! >