lxml/ElementTree and .tail

Chas Emerick cemerick at snowtide.com
Thu Nov 16 07:58:38 EST 2006


On Nov 16, 2006, at 7:25 AM, Fredrik Lundh wrote:

>> If I'm wrong, just chalk it up to the fact that this is the first
>> time I've ever looked at the Infoset spec, and I'm simply confused.
>
> the Infoset spec *is* the essence of XML; if you don't realize that an
> XML document is just a serialization of a very simple data model,  
> you're
> bound to be fighting with XML all the time.

The principle and the practice diverge significantly in our neck of  
the woods.  The current project involves consuming and making sense  
of extraordinarily (and typically unnecessarily) complex XHTML.  Of  
course, as you say, those documents are still serializations of a  
simple data model, but the types of manipulations we do happen to  
butt up very uncomfortably with the way ET does things.

> but ET doesn't implement the Infoset spec as it is, of course: it  
> uses a
> *simplified* model, carefully optimized for the large percentage of  
> all
> XML formats that simply doesn't use mixed content.  if you're doing
> document-style processing, you sometimes need to add an extra  
> assignment
> or two, but unless you're doing *only* document-style processing, ET's
> API gives you a net win.  (and even if you're doing only document- 
> style
> processing, ET's speed and memory footprint gives you a net win over
> most competing technologies).

Yeah, documents are all we do -- XML just happens to be a pleasant  
intermediate format, and something we need to consume.  The notion of  
an nicely-formatted XML is entirely foreign to the work that we do --  
in fact, our current focus is (in part) dragging decidedly  
unstructured data out of those XHTML documents (among other source  
formats) and putting them into a reasonable, useful structure.

I took some time last night to bang out some functions that squeezed  
ET's model (via lxml) into doing what we need, and it ended up  
requiring a lot more B&D than I like.  At that point, I swung over to  
4suite, which dropped into place quite nicely.

*shrug* I guess we're just in the minority with regard to our API  
requirements -- we happen to live in the corner cases.  I'm certainly  
glad to have made the detour on a different path for a bit though.

- Chas



More information about the Python-list mailing list