I'm working on a project that takes the ocr data found in a pdf and places it in a custom xml file.
I use Python scripts to create the xml file. Something like this (trimmed down a bit):
from xml.dom.minidom import Document
doc = Document()
Page = doc.createElement("Page")
f = StringIO(txt)
lines = f.readlines()
for line in lines:
word = doc.createElement("String")
return doc.toprettyxml(indent=" ",encoding="utf-8")
This creates a file, simply, that looks like this:
<?xml version="1.0" encoding="utf-8"?>
<Page HEIGHT="3296" WIDTH="2609">
<String CONTENT="BuffaloLaunch" />
<String CONTENT="Club" />
<String CONTENT="Offices" />
<String CONTENT="Installed" />
I am able to get this document to be created ok and saved to an xml file. The problem occurs when I try and have it read using the lxml library:
from lxml import etree
doc = etree.parse(filename)
I am running across errors like "XMLSyntaxError: Char 0xFFFF out of allowed range, line 94, column 19". Which when I look at the file, is true. There is a 0XFFFF character in the content field.
How is a file able to be created using minidom (which I assume would create a valid xml file) and then failing when parsing with lxml? What should I do to fix this on the encoding side so that errors don't show up on the parsing side?
How is the
Head of Digital Initiative
Paul Smith's College
[log in to unmask]
Become a friend of Paul Smith's Library on Facebook today!