On Thu, Mar 7, 2013 at 10:49 AM, Michael Beccaria <[log in to unmask]>wrote: > I ended up doing a regular expression find and replace function to replace > all illegal xml characters with a dash or something. :( A string translation map might be a better approach. Here's what I do as one part of a general purpose text "cleanup" method. {{{ illegal_unichrs = [ (0x00, 0x08), (0x0B, 0x1F), (0x7F, 0x84), (0x86, 0x9F), (0xD800, 0xDFFF), (0xFDD0, 0xFDDF), (0xFFFE, 0xFFFF), (0x1FFFE, 0x1FFFF), (0x2FFFE, 0x2FFFF), (0x3FFFE, 0x3FFFF), (0x4FFFE, 0x4FFFF), (0x5FFFE, 0x5FFFF), (0x6FFFE, 0x6FFFF), (0x7FFFE, 0x7FFFF), (0x8FFFE, 0x8FFFF), (0x9FFFE, 0x9FFFF), (0xAFFFE, 0xAFFFF), (0xBFFFE, 0xBFFFF), (0xCFFFE, 0xCFFFF), (0xDFFFE, 0xDFFFF), (0xEFFFE, 0xEFFFF), (0xFFFFE, 0xFFFFF), (0x10FFFE, 0x10FFFF) ] tmap = dict.fromkeys(r for start, end in illegal_unichrs for r in range(start, end+1)) ... text = text.translate(tmap) }}} See the str.translate() method at http://docs.python.org/2/library/stdtypes.html#string-methods --jay