cElementTree clear semantics

Fredrik Lundh fredrik at pythonware.com
Sun Sep 25 15:18:38 EDT 2005


Igor V. Rafienko wrote:

> Finally, I thought about keeping track of when to clear and when not
> to by subscribing to start and end elements (so that I would collect
> the entire <schnappi>-subtree in memory and only than release it):
>
> from cElementTree import iterparse
> clear_flag = True
> for event, elem in iterparse("data.xml", ("start", "end")):
>     if event == "start" and elem.tag == "schnappi":
> # start collecting elements
>         clear_flag = False
>     if event == "end" and elem.tag == "schnappi":
>         clear_flag = True
>         # do something with elem
>     # unless we are collecting elements, clear()
>     if clear_flag:
>         elem.clear()
>
> This gave me the desired behaviour, but:
>
> * It looks *very* ugly
> * It's twice as slow as version which sees 'end'-events only.
>
> Now, there *has* to be a better way. What am I missing?

the iterparse/clear approach works best if your XML file has a
record-like structure.  if you have toplevel records with lots of
schnappi records in them, iterate over the records and use find
(etc) to locate the subrecords you're interested in:

    for event, elem in iterparse("data.xml"):
        if event.tag == "record":
            # deal with schnappi subrecords
            for schappi in elem.findall(".//schnappi"):
                process(schnappi)
            elem.clear()

the collect flag approach isn't that bad ("twice as slow" doesn't
really say much: "raw" cElementTree is extremely fast compared
to the Python interpreter, so everything you end up doing in
Python will slow things down quite a bit).

to make your application code look a bit less convoluted, put the
logic in a generator function:

    # in library
    def process(filename, annoying_animal):
        clear = True
        start = "start"; end = "end"
        for event, elem in iterparse(filename, (start, end)):
            if elem.tag == annoying_animal:
                if event is start:
                    clear = False
                else:
                    yield elem
                    clear = True
            if clear:
                elem.clear()

    # in application
    for subelem in process(filename, "schnappi"):
         # do something with subelem

(I've reorganized the code a bit to cut down on the operations.
also note the "is" trick; iterparse returns the event strings you
pass in, so comparing on object identities is safe)

an alternative is to use the lower-level XMLParser class (which
is similar to SAX, but faster), but that will most likely result in
more and tricker Python code...

</F>






More information about the Python-list mailing list