lxml/ElementTree and .tail

Sat Nov 18 12:40:05 EST 2006

On Nov 18, 2006, at 11:29 AM, Fredrik Lundh wrote:

> Chas Emerick wrote:
>
>>> and keep patting our-
>>> selves on the back, while the rest of the world is busy routing  
>>> around
>>> us, switching to well-understood XML subsets or other serialization
>>> formats, simpler and more flexible data models, simpler API:s, and
>>> more robust code.  and Python ;-)
>>
>> That's flatly unrealistic.  If you'll remember, I'm not one of "those
>> people" that are specification-driven -- I hadn't even *heard* of
>> Infoset until earlier this week!
>
> The rant wasn't directed at you or anyone special, but I don't really
> think you got the point of it either.  Which is a bit strange, because
> it sounded like you *were* working on extracting information from  
> messy
> documents, so the "it's about the data, dammit" way of thinking
> shouldn't be news to you.

No, it's not any kind of news at all, and I'm very sympathetic to  
your specific perspective (and have advocated it in other contexts  
and circumstances, where appropriate).  And yes, we are in fact  
ensuring that we get from the HTML/XHTML/text/PDF/etc serialization  
we have to consume to a uniform, normalized, and "clean" data model  
in as few steps as possible.  However, in those few steps, we have to  
recognize the functional reality of how each data representation is  
used out in the world in order to translate it into a uniform model  
for our own purposes.  In concrete terms, that means that an end tag  
in an XHTML serialization means that that element is closed, done,  
finit.  Any other representation of that serialization doesn't  
correspond properly with the intent of that HTML document's author.

> And the routing around is not unrealistic, it's is a *fact*; JSON and
> POX are killing the full XML/Schema/SOAP stack for communication,  
> XHTML
> is pretty much dead as a wire format, people are apologizing in public
> for their use of SOAP, AJAX is quickly turning into AJAJ, few people
> care about the more obscure details of the XML 1.0 standard (when did
> you last see a conditional section? or even a DTD?), dealing with huge
> XML data sets is still extremely hard compared to just uploading the
> darn thing to a database and doing the crunching in SQL, and nobody  
> uses
> XML 1.1 for anything.
>
> Practicality beats purity, and the Internet routes around damage,  
> every
> single time.

I agree 100% -- but I would have thought that that's a point I would  
have made.  The model that ET uses seems like a "purified"  
representation of a mixed-content serialization, exactly because it  
is geared to an ideal rather than the practical realities of mixed  
content and expectations thereof.

For what it's worth, our current effort is directed towards providing  
significant stores/feeds of XML/PDF/HTML/text/etc in something that  
can be dropped into a RDBMS.  Perhaps that's the source of the  
impedance between us: you view Infoset as a functional replacement  
for serialization-dependent XML, whereas we are focussed on what  
could be broadly described as a translation from one to the other.

>> overwhelming majority of the developers out there care for nothing
>> but the serialization, simply because that's how one plays nicely
>> with others.
>
> The problem is if you only stare at the serialization, your code  
> *won't*
> play nicely with others.  At the serialization level, it's easy to  
> think
> that CDATA sections are different from other text, that character
> references are different from ordinary characters, that you should
> somehow be able to distinguish between <tag></tag> and <tag/>, that
> namespace prefixes are more important than the namespace URI, that an
>   in an XHTML-style stream is different from a U+00A0  
> character in
> memory, and so on.  In my experience, serialization-only thinking (at
> the receiving end) is the single most common cause for  
> interoperability
> problems when it comes to general XML interchange.

I agree with all of that.  I would again refer to the pervasive view  
of what end tags mean -- that's what I was primarily referring to  
with the term 'serialization'.

> (By the way, did ET fail to *read* your XML documents?  I thought your
> complaint was that it didn't put the things it read in a place  
> where you
> expected them to be, and that you didn't have time to learn how to  
> deal
> with that because you had more important things to do, at the time?)

No, it doesn't put things in the right places, so I consider that a  
failure of the model.  I don't see why I should have spent time  
learning how to deal with that when another very comprehensive  
library is available that does meet expectations.  *shrug*

Further, the fact that ET/lxml works the way that it does makes me  
think that there may be some other landmines in the underlying model  
that we might not have discovered until some days, weeks, etc., had  
passed, so there's a much greater comfort level in working with a  
library that explicitly supports the model that we expect (and was  
assumed when the HTML [now XHTML] documents in question were authored).

- Chas