LISTSERV 16.5 - CODE4LIB Archives

It sounds like your code isn't recognizing the XML file as UTF-8 (even
though the encoding is correctly marked in your example).

You could try telling the parser explicitly to use UTF-8, like this

parser = XMLParser(encoding="utf-8")

As discussed in
http://www.daniweb.com/software-development/python/threads/435360/using-xml.etree-with-xml-files-containing-a-symbol

There's also a bit of discussion about using lxml to parse UTF-8 in
http://stackoverflow.com/questions/3402520/is-there-a-way-to-force-lxml-to-parse-unicode-strings-that-specify-an-encoding-i

Hope this helps!

Regards,

Stuart







On Mon, Mar 4, 2013 at 3:00 PM, Michael Beccaria
<[log in to unmask]>wrote:

> I'm working on a project that takes the ocr data found in a pdf and places
> it in a custom xml file.
>
> I use Python scripts to create the xml file. Something like this (trimmed
> down a bit):
>
> from xml.dom.minidom import Document
> doc = Document()
>         Page = doc.createElement("Page")
>         doc.appendChild(Page)
>         f = StringIO(txt)
>         lines = f.readlines()
>         for line in lines:
>         word = doc.createElement("String")
>                 ...
>                 word.setAttribute("CONTENT",content)
>                 Page.appendChild(word)
>         return doc.toprettyxml(indent="  ",encoding="utf-8")
>
>
> This creates a file, simply, that looks like this:
> <?xml version="1.0" encoding="utf-8"?>
> <Page HEIGHT="3296" WIDTH="2609">
>   <String CONTENT="BuffaloLaunch" />
>   <String CONTENT="Club" />
>   <String CONTENT="Offices" />
>   <String CONTENT="Installed" />
>   ...
> </Page>
>
> I am able to get this document to be created ok and saved to an xml file.
> The problem occurs when I try and have it read using the lxml library:
>
> from lxml import etree
> doc = etree.parse(filename)
>
>
> I am running across errors like "XMLSyntaxError: Char 0xFFFF out of
> allowed range, line 94, column 19". Which when I look at the file, is true.
> There is a 0XFFFF character in the content field.
>
> How is a file able to be created using minidom (which I assume would
> create a valid xml file) and then failing when parsing with lxml? What
> should I do to fix this on the encoding side so that errors don't show up
> on the parsing side?
> Thanks,
> Mike
>
> How is the
> Mike Beccaria
> Systems Librarian
> Head of Digital Initiative
> Paul Smith's College
> 518.327.6376
> [log in to unmask]
> Become a friend of Paul Smith's Library on Facebook today!
>