

I'm working on a project that takes the ocr data found in a pdf and places it in a custom xml file.

I use Python scripts to create the xml file. Something like this (trimmed down a bit):

from xml.dom.minidom import Document
doc = Document()
	Page = doc.createElement("Page")
	f = StringIO(txt)
	lines = f.readlines()
	for line in lines:
	word = doc.createElement("String")
	return doc.toprettyxml(indent="  ",encoding="utf-8")	

This creates a file, simply, that looks like this:
<?xml version="1.0" encoding="utf-8"?>
<Page HEIGHT="3296" WIDTH="2609">
  <String CONTENT="BuffaloLaunch" />
  <String CONTENT="Club" />
  <String CONTENT="Offices" />
  <String CONTENT="Installed" />

I am able to get this document to be created ok and saved to an xml file. The problem occurs when I try and have it read using the lxml library:

from lxml import etree
doc = etree.parse(filename)

I am running across errors like "XMLSyntaxError: Char 0xFFFF out of allowed range, line 94, column 19". Which when I look at the file, is true. There is a 0XFFFF character in the content field.

How is a file able to be created using minidom (which I assume would create a valid xml file) and then failing when parsing with lxml? What should I do to fix this on the encoding side so that errors don't show up on the parsing side?

How is the
Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
[log in to unmask]
Become a friend of Paul Smith's Library on Facebook today!