lxml/ElementTree and .tail
Chas Emerick
cemerick at snowtide.com
Thu Nov 16 08:29:18 EST 2006
On Nov 16, 2006, at 8:12 AM, Fredrik Lundh wrote:
> Chas Emerick wrote:
>
>> The principle and the practice diverge significantly in our neck of
>> the woods. The current project involves consuming and making sense
>> of extraordinarily (and typically unnecessarily) complex XHTML.
>
> wasn't your original complaint that ET didn't do the "right thing"
> when
> you removed elements from a mixed-content tree? (something than can be
> trivially handled with a 2-line helper function)
Yes, that was the initial issue, but the delta between Elements and
DOM-style elements leads to other issues. There's no doubt that the
needed helpers are simple, but all things being equal, not having to
carry them around anywhere we're doing DOM manipulations is a big plus.
> why mutate the tree if all you want is to extract information from it?
> doesn't sound very efficient to me...
Because we're far from doing anything that is regular or one-off in
nature. We're systematizing the extraction of data from functionally
unstructured content, and it's flatly necessary to normalize the
XHTML into something that can be easily consumed by the processes
we've built that can do that content->data extraction/conversion from
plain text, XML, PDF, and now XHTML.
Remember, corner cases. :-)
- Chas
More information about the Python-list
mailing list