[XML-SIG] Problems with "ignorable whitespace" in python's minidom and pulldom !

Thu Mar 11 05:24:07 EST 2004

Arno Wilhelm <quirxi at aon.at> wrote:

 > The problem is that minidom seems to interpret these white spaces
 > as text nodes

That's correct. That's what the DOM specification says should be done by 
default.

In DOM Level 3 Load/Save there is a DOMConfiguration parameter 
'element-content-whitespace' which can be used to filter out ignorable 
whitespace at parse-time. However minidom does not yet support DOM 3 LS.

(Plug detour:) pxdom does support this, but like minidom it does not 
(yet) read external entities such as the DTD external subset, so unless 
you're putting <!ELEMENT> declarations in the internal subset of the 
<!DOCTYPE> they won't be able to tell which elements contain 'element 
content'; in this case whitespace is not 'ignorable' by design.

A workaround - and the only way to do it if you're not using DTDs anyway 
- is to tell pxdom to assume all undefined elements contain 'element 
content'. Hence the following example would give you a document free of 
whitespace nodes:

   import pxdom
   doc= pxdom.parse('filename.xml', {
     'element-content-whitespace': False,
     'pxdom-assume-element-content': True
   })

(End plug detour.)

 > I cannot know in before how many of these "text nodes" are in
 > between the real data nodes.

DOM specifies that the document text nodes will be in 'normal' form 
after parsing, so you can be sure it'll be 0 or 1, no more. (Unless 
you're using a *really* old minidom where this may not hold true.)

 > So I tried to find a way of getting rid of these unwanted text nodes
 > with this piece of code but that did not help either:

 > def cleanUpNodes( nodes ):
 >     for node in nodes.childNodes:
 >         if node.nodeType == Node.TEXT_NODE:
 >             node.data = string.strip(node.data)
 >     nodes.normalize()

That should work, you'd just need to make it recursive so it does the 
whole subtree not just the immediate children. Here's another version:

   def removeWhitespaceNodes(parent):
     for child in list(parent.childNodes):
       if child.nodeType==node.TEXT_NODE and node.data.strip()=='':
         parent.removeChild(child)
       else:
         removeWhitespaceNodes(child)

-- 
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/