[XML-SIG] Problems with "ignorable whitespace" in python's minidom and pulldom !

Thu Mar 11 17:21:40 EST 2004

Hello Andrew,

thanks for your answer. After doing some research on the internet I have found 
out that you are the author of the python pxdom module. How is pxdom compared to 
the standard dom and minidom implementation shipped with python itself ? Can it 
already be used in production environments ? How "fast" is it when parsing 
larger documents ? I have read that the next version 1.1 will also support 
external resource resolution & loading. Does that mean that it can also load 
external xml files linked to the actual xml document by a kind of url ?

regards,

Arno Wilhelm

> Arno Wilhelm <quirxi at aon.at> wrote:
> 
>  > The problem is that minidom seems to interpret these white spaces
>  > as text nodes
> 
> That's correct. That's what the DOM specification says should be done by 
> default.
> 
> In DOM Level 3 Load/Save there is a DOMConfiguration parameter 
> 'element-content-whitespace' which can be used to filter out ignorable 
> whitespace at parse-time. However minidom does not yet support DOM 3 LS.
> 
> (Plug detour:) pxdom does support this, but like minidom it does not 
> (yet) read external entities such as the DTD external subset, so unless 
> you're putting <!ELEMENT> declarations in the internal subset of the 
> <!DOCTYPE> they won't be able to tell which elements contain 'element 
> content'; in this case whitespace is not 'ignorable' by design.
> 
> A workaround - and the only way to do it if you're not using DTDs anyway 
> - is to tell pxdom to assume all undefined elements contain 'element 
> content'. Hence the following example would give you a document free of 
> whitespace nodes:
> 
>   import pxdom
>   doc= pxdom.parse('filename.xml', {
>     'element-content-whitespace': False,
>     'pxdom-assume-element-content': True
>   })
> 
> (End plug detour.)
> 
>  > I cannot know in before how many of these "text nodes" are in
>  > between the real data nodes.
> 
> DOM specifies that the document text nodes will be in 'normal' form 
> after parsing, so you can be sure it'll be 0 or 1, no more. (Unless 
> you're using a *really* old minidom where this may not hold true.)
> 
>  > So I tried to find a way of getting rid of these unwanted text nodes
>  > with this piece of code but that did not help either:
> 
>  > def cleanUpNodes( nodes ):
>  >     for node in nodes.childNodes:
>  >         if node.nodeType == Node.TEXT_NODE:
>  >             node.data = string.strip(node.data)
>  >     nodes.normalize()
> 
> That should work, you'd just need to make it recursive so it does the 
> whole subtree not just the immediate children. Here's another version:
> 
>   def removeWhitespaceNodes(parent):
>     for child in list(parent.childNodes):
>       if child.nodeType==node.TEXT_NODE and node.data.strip()=='':
>         parent.removeChild(child)
>       else:
>         removeWhitespaceNodes(child)
>