lxml/ElementTree and .tail
Chas Emerick
cemerick at snowtide.com
Thu Nov 16 06:50:10 EST 2006
Thanks for the comments and thoughts. I must admit that I have an
overwhelming feeling of having just stepped into the middle of a
complex, heated conversation without having heard the preamble.
(FYI, this reply is only an attempt to help those that come
afterwards -- I'm not looking to advocate much of anything here.)
Fredrik's invocation of the "infoset" term led me to a couple of
quick searches that clarified the state of play. Here he sets the
stage for the .tail behaviour that I originally posted about:
http://effbot.org/zone/element-infoset.htm
And it looks like there have been tussles over other mismatches in
expectations before, specifically around how namespaces are handled:
http://groups.google.com/group/comp.lang.python/browse_thread/thread/
31b2e9f4a8f7338c
http://nixforums.org/ntopic43901.html
From what I can see, there are more than a few people that have
stumbled with ElementTree's API because of their preexisting
expectations, which others have probably correctly bucketed as
"implementation details". This comes as quite a shock to those who
have stumbled (including myself) who have, lo these many years, come
to view those details as the only standard that matters (perhaps
simply because those details have been so consistent in our experience).
Which, in my view, is just fine -- different strokes for different
folks, and all that. When I originally started poking around the
python xml world, I was somewhat confused as to why 4suite/Domlette
existed, as it seemed pretty clear that ElementTree had crystallized
a lot of mindshare, and has a very attractive API to boot.
Thankfully, I can now see its appeal, and am very glad it's around,
as it seems to have all of those comfortable implementation details
that I've been looking for. :-)
As for the infoset vs. "sequence of piggies" nut: if ElementTree's
infoset approach is technically correct, then wouldn't it also be
correct to use a .head attribute instead of a .tail attribute? Example:
<a>first<b>middle</b>last</a>
might be represented as:
<Element a: head='', text='last'>
<Element b: head='first', text='middle'>
If I'm wrong, just chalk it up to the fact that this is the first
time I've ever looked at the Infoset spec, and I'm simply confused.
If that IS a technically-valid way to represent the above xml
fragment . . . then I guess I'll make sure to tread more carefully in
the future around tools that work in infoset terms. For me, it turns
out that sequences of piggies really are important, at least in
contexts where XML is merely a means to an end (either because of the
attractiveness of the toolsets or because we must cope with what
we're provided as input) and where consistency with existing tools
(like those that adhere to DOM level 2/3) and expectations are
critical. I think this is what Paul was nodding towards with his
original response to Stefan's response.
Cheers,
- Chas
On Nov 16, 2006, at 5:11 AM, Fredrik Lundh wrote:
> Paul Boddie wrote:
>
>>> Yes, it is. Just look at the API. It's an attribute of an
>>> Element, isn't it?
>>> What other API do you know where removing an element from a data
>>> structure
>>> leaves part of the element behind?
>>
>> I guess it depends on what you regard an element to be...
>
> Stefan said "Element", not "element".
>
> "Element" is a class in the "ElementTree" module, which can be used to
> *represent* an XML element in an XML infoset, including all the data
> *inside* the XML element, and any data *between* that XML element and
> the next one (which is always character data, of course).
>
> It's not very difficult, really; especially if you, as Stefan said,
> think in infoset terms rather "a sequence of little piggies" terms.
>
> </F>
More information about the Python-list
mailing list