xml input sanitizing method in standard lib?

Stefan Behnel stefan_ml at behnel.de
Tue Mar 10 03:40:10 EDT 2009


Gabriel Genellina wrote:
> En Mon, 09 Mar 2009 15:30:31 -0200, Petr Muller <afri at afri.cz> escribió:
> 
>> Thanks for response and sorry for I wasn't clear first time. I have a
>> heap of data (logs), from which I build a XML document using
>> xml.dom.minidom. In this data, some xml invalid characters may occur -
>> form feed (\x0c) character is one example.
>>
>> I don't know what else is illegal in xml, so I've searched if there's
>> some method how to prepare strings for insertion to a xml doc before I
>> start research on a xml spec and write such function on my own.
> 
> You don't have to; Python already comes with xml support. Using
> ElementTree to build the document is usually easier and faster:
> http://effbot.org/zone/element-index.htm

While I usually second that, this isn't the problem here. This thread is
about unallowed characters in XML. The set of allowed characters is defined
here:

http://www.w3.org/TR/xml/#charsets

And, as Terry Reedy pointed out, the "unicode.translate" method should get
you where you want. Just define a dict that maps the characters that you
want to remove to whatever character you want to use instead (or None) and
pass that into .translate().

Stefan



More information about the Python-list mailing list