Problem round-tripping with xml.dom.minidom pretty-printer

Robert Bossy Robert.Bossy at jouy.inra.fr
Fri Feb 29 11:55:06 EST 2008


Ben Butler-Cole wrote:
> Hello
>
> I have run into a problem using minidom. I have an HTML file that I
> want to make occasional, automated changes to (adding new links). My
> strategy is to parse it with minidom, add a node, pretty print it and
> write it back to disk.
>
> However I find that every time I do a round trip minidom's pretty
> printer puts extra blank lines around every element, so my file grows
> without limit. I have found that normalizing the document doesn't make
> any difference. Obviously I can fix the problem by doing without the
> pretty-printing, but I don't really like producing non-human readable
> HTML.
>
> Here is some code that shows the behaviour:
>
>     import xml.dom.minidom as dom
>     def p(t):
>         d = dom.parseString(t)
>         d.normalize()
>         t2 = d.toprettyxml()
>         print t2
>         p(t2)
>     p('<a><b><c/></b></a>')
>
> Does anyone know how to fix this behaviour? If not, can anyone
> recommend an alternative XML tool for simple tasks like this?
Hi,

The last line of p() calls itself: it is an unconditional recursive call 
so, no matter what it does, it will never stop. And since p() also 
prints something, calling it will print endlessly. By removing this 
line, you get something like:

<?xml version="1.0" ?>
<a>
        <b>
                <c/>
        </b>
</a>

That seems sensible, imo. Was that what you wanted?

An additional thing to keep in mind is that toprettyxml does not print 
an XML identical to the original DOM tree: it adds newlines and tabs. 
When parsed again these blank characters are inserted in the DOM tree as 
character nodes. If you toprettyxml an XML document twice in a row, then 
the second one will also add newlines and tabs around the newlines and 
tabs added by the first. Since you call toprettyxml an infinite number 
of times, it is expected that lots of blank characters appear.

Finally, normalize() is supposed to merge consecutive sibling character 
nodes, however it will never remove character contents even if they are 
blank. That means that several character
nodes will be replaced by a single one whose content is the 
concatenation of the respective content of the original nodes. Clear enough?

Cheers,
RB



More information about the Python-list mailing list