lxml/ElementTree and .tail
Chas Emerick
cemerick at snowtide.com
Sat Nov 18 12:40:05 EST 2006
On Nov 18, 2006, at 11:29 AM, Fredrik Lundh wrote:
> Chas Emerick wrote:
>
>>> and keep patting our-
>>> selves on the back, while the rest of the world is busy routing
>>> around
>>> us, switching to well-understood XML subsets or other serialization
>>> formats, simpler and more flexible data models, simpler API:s, and
>>> more robust code. and Python ;-)
>>
>> That's flatly unrealistic. If you'll remember, I'm not one of "those
>> people" that are specification-driven -- I hadn't even *heard* of
>> Infoset until earlier this week!
>
> The rant wasn't directed at you or anyone special, but I don't really
> think you got the point of it either. Which is a bit strange, because
> it sounded like you *were* working on extracting information from
> messy
> documents, so the "it's about the data, dammit" way of thinking
> shouldn't be news to you.
No, it's not any kind of news at all, and I'm very sympathetic to
your specific perspective (and have advocated it in other contexts
and circumstances, where appropriate). And yes, we are in fact
ensuring that we get from the HTML/XHTML/text/PDF/etc serialization
we have to consume to a uniform, normalized, and "clean" data model
in as few steps as possible. However, in those few steps, we have to
recognize the functional reality of how each data representation is
used out in the world in order to translate it into a uniform model
for our own purposes. In concrete terms, that means that an end tag
in an XHTML serialization means that that element is closed, done,
finit. Any other representation of that serialization doesn't
correspond properly with the intent of that HTML document's author.
> And the routing around is not unrealistic, it's is a *fact*; JSON and
> POX are killing the full XML/Schema/SOAP stack for communication,
> XHTML
> is pretty much dead as a wire format, people are apologizing in public
> for their use of SOAP, AJAX is quickly turning into AJAJ, few people
> care about the more obscure details of the XML 1.0 standard (when did
> you last see a conditional section? or even a DTD?), dealing with huge
> XML data sets is still extremely hard compared to just uploading the
> darn thing to a database and doing the crunching in SQL, and nobody
> uses
> XML 1.1 for anything.
>
> Practicality beats purity, and the Internet routes around damage,
> every
> single time.
I agree 100% -- but I would have thought that that's a point I would
have made. The model that ET uses seems like a "purified"
representation of a mixed-content serialization, exactly because it
is geared to an ideal rather than the practical realities of mixed
content and expectations thereof.
For what it's worth, our current effort is directed towards providing
significant stores/feeds of XML/PDF/HTML/text/etc in something that
can be dropped into a RDBMS. Perhaps that's the source of the
impedance between us: you view Infoset as a functional replacement
for serialization-dependent XML, whereas we are focussed on what
could be broadly described as a translation from one to the other.
>> overwhelming majority of the developers out there care for nothing
>> but the serialization, simply because that's how one plays nicely
>> with others.
>
> The problem is if you only stare at the serialization, your code
> *won't*
> play nicely with others. At the serialization level, it's easy to
> think
> that CDATA sections are different from other text, that character
> references are different from ordinary characters, that you should
> somehow be able to distinguish between <tag></tag> and <tag/>, that
> namespace prefixes are more important than the namespace URI, that an
> in an XHTML-style stream is different from a U+00A0
> character in
> memory, and so on. In my experience, serialization-only thinking (at
> the receiving end) is the single most common cause for
> interoperability
> problems when it comes to general XML interchange.
I agree with all of that. I would again refer to the pervasive view
of what end tags mean -- that's what I was primarily referring to
with the term 'serialization'.
> (By the way, did ET fail to *read* your XML documents? I thought your
> complaint was that it didn't put the things it read in a place
> where you
> expected them to be, and that you didn't have time to learn how to
> deal
> with that because you had more important things to do, at the time?)
No, it doesn't put things in the right places, so I consider that a
failure of the model. I don't see why I should have spent time
learning how to deal with that when another very comprehensive
library is available that does meet expectations. *shrug*
Further, the fact that ET/lxml works the way that it does makes me
think that there may be some other landmines in the underlying model
that we might not have discovered until some days, weeks, etc., had
passed, so there's a much greater comfort level in working with a
library that explicitly supports the model that we expect (and was
assumed when the HTML [now XHTML] documents in question were authored).
- Chas
More information about the Python-list
mailing list