lxml/ElementTree and .tail

Thu Nov 16 06:50:10 EST 2006

Thanks for the comments and thoughts.  I must admit that I have an  
overwhelming feeling of having just stepped into the middle of a  
complex, heated conversation without having heard the preamble.

(FYI, this reply is only an attempt to help those that come  
afterwards -- I'm not looking to advocate much of anything here.)

Fredrik's invocation of the "infoset" term led me to a couple of  
quick searches that clarified the state of play.  Here he sets the  
stage for the .tail behaviour that I originally posted about:

http://effbot.org/zone/element-infoset.htm

And it looks like there have been tussles over other mismatches in  
expectations before, specifically around how namespaces are handled:

http://groups.google.com/group/comp.lang.python/browse_thread/thread/ 
31b2e9f4a8f7338c
http://nixforums.org/ntopic43901.html

 From what I can see, there are more than a few people that have  
stumbled with ElementTree's API because of their preexisting  
expectations, which others have probably correctly bucketed as  
"implementation details".  This comes as quite a shock to those who  
have stumbled (including myself) who have, lo these many years, come  
to view those details as the only standard that matters (perhaps  
simply because those details have been so consistent in our experience).

Which, in my view, is just fine -- different strokes for different  
folks, and all that.  When I originally started poking around the  
python xml world, I was somewhat confused as to why 4suite/Domlette  
existed, as it seemed pretty clear that ElementTree had crystallized  
a lot of mindshare, and has a very attractive API to boot.   
Thankfully, I can now see its appeal, and am very glad it's around,  
as it seems to have all of those comfortable implementation details  
that I've been looking for. :-)

As for the infoset vs. "sequence of piggies" nut: if ElementTree's  
infoset approach is technically correct, then wouldn't it also be  
correct to use a .head attribute instead of a .tail attribute?  Example:

<a>first<b>middle</b>last</a>

might be represented as:

<Element a: head='', text='last'>
     <Element b: head='first', text='middle'>

If I'm wrong, just chalk it up to the fact that this is the first  
time I've ever looked at the Infoset spec, and I'm simply confused.   
If that IS a technically-valid way to represent the above xml  
fragment . . . then I guess I'll make sure to tread more carefully in  
the future around tools that work in infoset terms.  For me, it turns  
out that sequences of piggies really are important, at least in  
contexts where XML is merely a means to an end (either because of the  
attractiveness of the toolsets or because we must cope with what  
we're provided as input) and where consistency with existing tools  
(like those that adhere to DOM level 2/3) and expectations are  
critical.  I think this is what Paul was nodding towards with his  
original response to Stefan's response.

Cheers,

- Chas

On Nov 16, 2006, at 5:11 AM, Fredrik Lundh wrote:

> Paul Boddie wrote:
>
>>> Yes, it is. Just look at the API. It's an attribute of an  
>>> Element, isn't it?
>>> What other API do you know where removing an element from a data  
>>> structure
>>> leaves part of the element behind?
>>
>> I guess it depends on what you regard an element to be...
>
> Stefan said "Element", not "element".
>
> "Element" is a class in the "ElementTree" module, which can be used to
> *represent* an XML element in an XML infoset, including all the data
> *inside* the XML element, and any data *between* that XML element and
> the next one (which is always character data, of course).
>
> It's not very difficult, really; especially if you, as Stefan said,
> think in infoset terms rather "a sequence of little piggies" terms.
>
> </F>