Preventing control characters from entering an XML file
Frank Niessink
frank at niessink.com
Sun Jan 1 15:07:56 EST 2006
Hi list,
First of all, I wish you all a happy 2006. I have a small question that
googling didn't turn up an answer for. So hopefully you'll be kind
enough to send me in the right direction.
I'm developing a desktop application, called Task Coach, that saves its
domain objects (tasks, mostly :-) in an XML file. Users have reported
that sometimes their Task Coach file would become unreadable by Task
Coach after copying information from some other application into e.g. a
task description. Looking at the 'corrupted' file showed that control
characters ended up in the XML file (Control-K for example). Task Coach
uses xml.dom to create an XML document and save it, like this:
class XMLWriter:
...
def write(self, taskList):
domImplementation = xml.dom.getDOMImplementation()
self.document = domImplementation.createDocument(None, 'tasks',
None)
...
for task in taskList.rootTasks():
self.document.documentElement.appendChild(self.taskNode(task))
self.document.writexml(self.__fd) # __fd is a file open for writing
...
Apparently, the writexml method of xml.dom (which comes from
xml.dom.minidom if pyxml is not installed I think) does not feel that
writing control characters in an XML file is wrong, but the parser does:
Traceback (most recent call last):
...
File "c:\Program Files\Python24\lib\xml\dom\expatbuilder.py", line
207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 77,
column 147
Rightfully so, because ^K is not valid XML 1.0, according to
http://www.w3.org/TR/REC-xml/:
"Legal characters are tab, carriage return, line feed, and the legal
characters of Unicode and ISO/IEC 10646. [...] Consequently, XML
processors MUST accept any character in the range specified for Char.
Character Range
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]"
So, all this leads me to the following questions:
- Why does the writexml method of the document created by the object
returned by domImplementation() allow control characters? Isn't that a bug?
- What is the easiest/most pythonic (preferably build-in) way of
checking a unicode string for control characters and weeding those
characters out?
Thanks, Frank
More information about the Python-list
mailing list