On Thu, Mar 7, 2013 at 10:49 AM, Michael Beccaria
<[log in to unmask]>wrote:
> I ended up doing a regular expression find and replace function to replace
> all illegal xml characters with a dash or something.
:(
A string translation map might be a better approach. Here's what I do as
one part of a general purpose text "cleanup" method.
{{{
illegal_unichrs = [ (0x00, 0x08), (0x0B, 0x1F), (0x7F, 0x84), (0x86, 0x9F),
(0xD800, 0xDFFF), (0xFDD0, 0xFDDF), (0xFFFE, 0xFFFF),
(0x1FFFE, 0x1FFFF), (0x2FFFE, 0x2FFFF), (0x3FFFE, 0x3FFFF),
(0x4FFFE, 0x4FFFF), (0x5FFFE, 0x5FFFF), (0x6FFFE, 0x6FFFF),
(0x7FFFE, 0x7FFFF), (0x8FFFE, 0x8FFFF), (0x9FFFE, 0x9FFFF),
(0xAFFFE, 0xAFFFF), (0xBFFFE, 0xBFFFF), (0xCFFFE, 0xCFFFF),
(0xDFFFE, 0xDFFFF), (0xEFFFE, 0xEFFFF), (0xFFFFE, 0xFFFFF),
(0x10FFFE, 0x10FFFF) ]
tmap = dict.fromkeys(r for start, end in illegal_unichrs for r in
range(start, end+1))
...
text = text.translate(tmap)
}}}
See the str.translate() method at
http://docs.python.org/2/library/stdtypes.html#string-methods
--jay
|