XML parser that sorts elements?

Fri Sep 22 12:03:06 EDT 2006

<jmike at alum.mit.edu> wrote in message 
news:1158939240.213404.239690 at m73g2000cwd.googlegroups.com...
> Hi everyone,
>
> I am a total newbie to XML parsing.  I've written a couple of toy
> examples under the instruction of tutorials available on the web.
>
> The problem I want to solve is this.  I have an XML snippet (in a
> string) that looks like this:
>
> <booga foo="1" bar="2">
>  <well>hello</well>
>  <blah>goodbye</blah>
> </booga>
>
> and I want to alphabetize not only the attributes of an element, but I
> also want to alphabetize the elements in the same scope:
>
> <booga bar="2" foo="1">
>  <blah>goodbye</blah>
>  <well>hello</well>
> </booga>
>
> I've found a "Canonizer" class, that subclasses saxlib.HandlerBase, and
> played around with it and vaguely understand what it's doing.  But what
> I get out of it is
>
> <booga bar="2" foo="1">
>  <well>hello</well>
>  <blah>goodbye</blah>
> </booga>
>
> in other words it sorts the attributes of each element, but doesn't
> touch the order of the elements.
>
> How can I sort the elements?  I think I want to subclass the parser, to
> present the elements to the content handler in different order, but I
> couldn't immediately find any examples of the parser being subclassed.
>

I suspect that Canonizer doesn't sort nested elements because some schemas 
require elements to be in a particular order, and not necessarily an 
alphabetical one.

Here is a snippet from an interactive Python session, working with the 
"batteries included" xml.dom.minidom.  The solution is not necessarily in 
the parser, it may be instead in what you do with the parsed document 
object.

This is not a solution to your actual problem, but I hope it gives you 
enough to work with to find your own solution.

HTH,
-- Paul

>>> xmlsrc = """<booga foo="1" bar="2">
...   <well>hello</well>
...   <blah>goodbye</blah>
... </booga>
... """
>>> import xml.dom.minidom
>>> doc = xml.dom.minidom.parseString(xmlsrc)
>>> doc.childNodes
[<DOM Element: booga at 0x9e8508>]
>>> print doc.toprettyxml()
<?xml version="1.0" ?>
<booga bar="2" foo="1">

        <well>
                hello
        </well>

        <blah>
                goodbye
        </blah>

</booga>

>>> [n.nodeName for n in doc.childNodes]
[u'booga']
>>> [n.nodeName for n in doc.childNodes[0].childNodes]
['#text', u'well', '#text', u'blah', '#text']
>>> [n.nodeName for n in doc.childNodes[0].childNodes if n.nodeType == 
>>> doc.ELEMENT_NODE]
[u'well', u'blah']
>>> doc.childNodes[0].childNodes = 
>>> sorted(doc.childNodes[0].childNodes,key=lambda n:n.nodeName)
>>> print doc.toprettyxml()
<?xml version="1.0" ?>
<booga bar="2" foo="1">

        <blah>
                goodbye
        </blah>
        <well>
                hello
        </well>
</booga>

>>> doc.childNodes[0].childNodes = sorted([n for n in 
>>> doc.childNodes[0].childNodes if n.nodeType == 
>>> doc.ELEMENT_NODE],key=lambda n:n.nodeName)
>>> print doc.toprettyxml()
<?xml version="1.0" ?>
<booga bar="2" foo="1">
        <blah>
                goodbye
        </blah>
        <well>
                hello
        </well>
</booga>

>>>