[XML-SIG] Problems with "ignorable whitespace" in python's minidom
and pulldom !
Arno Wilhelm
quirxi at aon.at
Thu Mar 11 17:21:40 EST 2004
Hello Andrew,
thanks for your answer. After doing some research on the internet I have found
out that you are the author of the python pxdom module. How is pxdom compared to
the standard dom and minidom implementation shipped with python itself ? Can it
already be used in production environments ? How "fast" is it when parsing
larger documents ? I have read that the next version 1.1 will also support
external resource resolution & loading. Does that mean that it can also load
external xml files linked to the actual xml document by a kind of url ?
regards,
Arno Wilhelm
> Arno Wilhelm <quirxi at aon.at> wrote:
>
> > The problem is that minidom seems to interpret these white spaces
> > as text nodes
>
> That's correct. That's what the DOM specification says should be done by
> default.
>
> In DOM Level 3 Load/Save there is a DOMConfiguration parameter
> 'element-content-whitespace' which can be used to filter out ignorable
> whitespace at parse-time. However minidom does not yet support DOM 3 LS.
>
> (Plug detour:) pxdom does support this, but like minidom it does not
> (yet) read external entities such as the DTD external subset, so unless
> you're putting <!ELEMENT> declarations in the internal subset of the
> <!DOCTYPE> they won't be able to tell which elements contain 'element
> content'; in this case whitespace is not 'ignorable' by design.
>
> A workaround - and the only way to do it if you're not using DTDs anyway
> - is to tell pxdom to assume all undefined elements contain 'element
> content'. Hence the following example would give you a document free of
> whitespace nodes:
>
> import pxdom
> doc= pxdom.parse('filename.xml', {
> 'element-content-whitespace': False,
> 'pxdom-assume-element-content': True
> })
>
> (End plug detour.)
>
> > I cannot know in before how many of these "text nodes" are in
> > between the real data nodes.
>
> DOM specifies that the document text nodes will be in 'normal' form
> after parsing, so you can be sure it'll be 0 or 1, no more. (Unless
> you're using a *really* old minidom where this may not hold true.)
>
> > So I tried to find a way of getting rid of these unwanted text nodes
> > with this piece of code but that did not help either:
>
> > def cleanUpNodes( nodes ):
> > for node in nodes.childNodes:
> > if node.nodeType == Node.TEXT_NODE:
> > node.data = string.strip(node.data)
> > nodes.normalize()
>
> That should work, you'd just need to make it recursive so it does the
> whole subtree not just the immediate children. Here's another version:
>
> def removeWhitespaceNodes(parent):
> for child in list(parent.childNodes):
> if child.nodeType==node.TEXT_NODE and node.data.strip()=='':
> parent.removeChild(child)
> else:
> removeWhitespaceNodes(child)
>
More information about the XML-SIG
mailing list