[XML-SIG] Problems with "ignorable whitespace" in python's minidom
and pulldom !
Andrew Clover
and-xml at doxdesk.com
Thu Mar 11 05:24:07 EST 2004
Arno Wilhelm <quirxi at aon.at> wrote:
> The problem is that minidom seems to interpret these white spaces
> as text nodes
That's correct. That's what the DOM specification says should be done by
default.
In DOM Level 3 Load/Save there is a DOMConfiguration parameter
'element-content-whitespace' which can be used to filter out ignorable
whitespace at parse-time. However minidom does not yet support DOM 3 LS.
(Plug detour:) pxdom does support this, but like minidom it does not
(yet) read external entities such as the DTD external subset, so unless
you're putting <!ELEMENT> declarations in the internal subset of the
<!DOCTYPE> they won't be able to tell which elements contain 'element
content'; in this case whitespace is not 'ignorable' by design.
A workaround - and the only way to do it if you're not using DTDs anyway
- is to tell pxdom to assume all undefined elements contain 'element
content'. Hence the following example would give you a document free of
whitespace nodes:
import pxdom
doc= pxdom.parse('filename.xml', {
'element-content-whitespace': False,
'pxdom-assume-element-content': True
})
(End plug detour.)
> I cannot know in before how many of these "text nodes" are in
> between the real data nodes.
DOM specifies that the document text nodes will be in 'normal' form
after parsing, so you can be sure it'll be 0 or 1, no more. (Unless
you're using a *really* old minidom where this may not hold true.)
> So I tried to find a way of getting rid of these unwanted text nodes
> with this piece of code but that did not help either:
> def cleanUpNodes( nodes ):
> for node in nodes.childNodes:
> if node.nodeType == Node.TEXT_NODE:
> node.data = string.strip(node.data)
> nodes.normalize()
That should work, you'd just need to make it recursive so it does the
whole subtree not just the immediate children. Here's another version:
def removeWhitespaceNodes(parent):
for child in list(parent.childNodes):
if child.nodeType==node.TEXT_NODE and node.data.strip()=='':
parent.removeChild(child)
else:
removeWhitespaceNodes(child)
--
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/
More information about the XML-SIG
mailing list