[Python-Dev] Integrate lxml into the stdlib?

Stefan Behnel stefan_ml at behnel.de
Thu Mar 5 21:34:41 CET 2009


Guido van Rossum wrote:
> On Thu, Mar 5, 2009 at 11:22 AM, Stefan Behnel wrote:
>> I'm happy to see you jump onto this.
> 
> I'm not sure why you say that -- all I am doing is advising *against* inclusion.

I understand that. What worth is a discussion where everyone just nods for
good? :)


>> Guido van Rossum wrote:
> There's *waaaay* too much stuff in the XML world to ever hope to have
> comprehensive support in the stdlib.

Definitely. But lxml was born because some Dutch guy thought that there was
way too little easy-to-master XML support in the overall Python world.

http://codespeak.net/lxml/intro.html

There is some space to look for a trade-off here.


>> It's good that (c)ElementTree is part of the stdlib, and it's also good
>> that there is a rather smooth upgrade path towards lxml.
> 
> And yet it worries me that lxml claims to be "mostly compatible" with
> ElementTree. What's keeping it from being completely (backwards)
> compatible?

The underlying tree model. An Element in lxml.etree knows it's parent,
which isn't the case in ET. That's the main difference. Most people call
that a feature in lxml, but it's fundamental and it does have the
implication that you can't keep the same Element object in more than one place.

Some other (minor) differences are described here:

http://codespeak.net/lxml/dev/compatibility.html


>> But lxml is by
>> itself becoming more and more a critical dependency of web related packages
>> and applications, simply because it provides everything in one tool.
> 
> That depends on how XML-centric your thinking is. Personally I *don't*
> like putting everything in XML, and so far I have been able to keep my
> code 99% XML-free.

That's totally fine. I used Python for years without ever feeling the need
to deploy any of the dbm databases in my projects. Nor curses, nor tk. And
lxml.objectify only supports pickle because one of the developers thought
it was a good idea to pickle trees. And yet all of these modules are part
of the stdlib, and I bet there are a whole lot of applications by now that
wouldn't work without them.


>> I would expect that even if lxml itself was in the stdlib, Linux
>> distributions would (want to) build it against their system libraries.
>> Static builds would only be required on MacOS-X and Windows.
> 
> And that in itself is one of the main arguments against inclusion in
> the stdlib, since it adds a whole new level of complexity to the
> compatibility matrix. E.g. assume that some newer version of libxml2
> has a new feature.

That happens. So far, I have managed to keep lxml backwards compatible over
more than three years of libxml2 releases. However:

> You can wrap that feature with an API in lxml, but
> now you require that newer libxml2 version as a dependency. Since the
> distros don't support that they either are prevented from providing
> the corresponding newer version of Python or you will have to make the
> lxml code conditional on the presence or absence of that API. The
> latter is preferable, but now it means that Python users can't rely on
> that API being present even if they have the right version of Python.
> It's a mess. Requiring a 3rd party download makes this cleaner,
> because you decouple the llibxml2/lxml versioning from the Python
> version.

A good example is actually (once again) parsing broken HTML. libxml2
handles this a lot better since 2.6.21, so if you use 2.6.20, you will
simply not get the same results as with a later version.

I do see the point you are making here. Even if lxml gets mature and
static, that doesn't necessarily apply to the external libraries it uses.
However, I should note that exactly the same argument also applies to
sqlite3 and gdbm, which, again, are in the stdlib today, with sqlite3 being
a fairly recent addition.

Stefan



More information about the Python-Dev mailing list