Preventing control characters from entering an XML file

Sun Jan 1 15:07:56 EST 2006

Hi list,

First of all, I wish you all a happy 2006. I have a small question that 
googling didn't turn up an answer for. So hopefully you'll be kind 
enough to send me in the right direction.

I'm developing a desktop application, called Task Coach, that saves its 
domain objects (tasks, mostly :-) in an XML file. Users have reported 
that sometimes their Task Coach file would become unreadable by Task
Coach after copying information from some other application into e.g. a 
task description. Looking at the 'corrupted' file showed that control 
characters ended up in the XML file (Control-K for example). Task Coach 
uses xml.dom to create an XML document and save it, like this:

class XMLWriter:
   ...

   def write(self, taskList):
     domImplementation = xml.dom.getDOMImplementation()
     self.document = domImplementation.createDocument(None, 'tasks',
                                                      None)
     ...
     for task in taskList.rootTasks():
       self.document.documentElement.appendChild(self.taskNode(task))
     self.document.writexml(self.__fd) # __fd is a file open for writing

   ...

Apparently, the writexml method of xml.dom (which comes from 
xml.dom.minidom if pyxml is not installed I think) does not feel that 
writing control characters in an XML file is wrong, but the parser does:

Traceback (most recent call last):
...
   File "c:\Program Files\Python24\lib\xml\dom\expatbuilder.py", line 
207, in parseFile
     parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 77, 
column 147

Rightfully so, because ^K is not valid XML 1.0, according to 
http://www.w3.org/TR/REC-xml/:

"Legal characters are tab, carriage return, line feed, and the legal 
characters of Unicode and ISO/IEC 10646. [...] Consequently, XML 
processors MUST accept any character in the range specified for Char.

Character Range
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x10000-#x10FFFF]"

So, all this leads me to the following questions:
- Why does the writexml method of the document created by the object 
returned by domImplementation() allow control characters? Isn't that a bug?
- What is the easiest/most pythonic (preferably build-in) way of 
checking a unicode string for control characters and weeding those 
characters out?

Thanks, Frank