lxml/ElementTree and .tail

Thu Nov 16 08:29:18 EST 2006

On Nov 16, 2006, at 8:12 AM, Fredrik Lundh wrote:

> Chas Emerick wrote:
>
>> The principle and the practice diverge significantly in our neck of
>> the woods.  The current project involves consuming and making sense
>> of extraordinarily (and typically unnecessarily) complex XHTML.
>
> wasn't your original complaint that ET didn't do the "right thing"  
> when
> you removed elements from a mixed-content tree? (something than can be
> trivially handled with a 2-line helper function)

Yes, that was the initial issue, but the delta between Elements and  
DOM-style elements leads to other issues.  There's no doubt that the  
needed helpers are simple, but all things being equal, not having to  
carry them around anywhere we're doing DOM manipulations is a big plus.

> why mutate the tree if all you want is to extract information from it?
> doesn't sound very efficient to me...

Because we're far from doing anything that is regular or one-off in  
nature.  We're systematizing the extraction of data from functionally  
unstructured content, and it's flatly necessary to normalize the  
XHTML into something that can be easily consumed by the processes  
we've built that can do that content->data extraction/conversion from  
plain text, XML, PDF, and now XHTML.

Remember, corner cases. :-)

- Chas