[Python-Dev] Integrate lxml into the stdlib? (was: Integrate BeautifulSoup into stdlib?)

Thu Mar 5 20:35:39 CET 2009

On Thu, Mar 5, 2009 at 11:22 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:
> I'm happy to see you jump onto this.

I'm not sure why you say that -- all I am doing is advising *against* inclusion.

> Guido van Rossum wrote:
>> No matter how beautiful and fast lxml is, it has one downside where it
>> comes to installing it into the stdlib: it is based on large, complex
>> 3rd party libraries, libxml2 and libxslt.
>
> I actually had a recent discussion with other lxml developers and we were
> fast to agree that that would be the main problem. lxml itself is only some
> 18K lines of Cython code (which translates into 180K lines of C code) and
> less than 7K lines of Python code, but libxml2 and libxslt add up to about
> 230K lines of C code just by themselves. That is definitely far from
> trivial and it's hard to guarantee that bugs in these libraries will never
> lead to security holes in a Python release, for example.
>
> Still, it does provide an awful lot of things that the stdlib currently
> fails to deliver in one way or another, some even completely. XPath, XSLT,
> XML validation and (above all) real-world HTML parsing come to mind. I
> definitely stopped counting the posts on c.l.py about HTMLParser not being
> able to parse a specific web page.

There's *waaaay* too much stuff in the XML world to ever hope to have
comprehensive support in the stdlib. Heck, XmlPlus hasn't even been
incorporated into the stdlib.

> It's good that (c)ElementTree is part of the stdlib, and it's also good
> that there is a rather smooth upgrade path towards lxml.

And yet it worries me that lxml claims to be "mostly compatible" with
ElementTree. What's keeping it from being completely (backwards)
compatible?

> But lxml is by
> itself becoming more and more a critical dependency of web related packages
> and applications, simply because it provides everything in one tool.

That depends on how XML-centric your thinking is. Personally I *don't*
like putting everything in XML, and so far I have been able to keep my
code 99% XML-free.

> And
> even if I wasn't the author of lxml, I would have a hard time feeling happy
> if a real-world HTML parser was added to the stdlib that provides a totally
> different interface than the best (and fastest) XML library that the stdlib
> currently has.

That sounds like a completely different argument and one you should
have with the proponents of inclusion of that other parser. I can only
assume you're talking about html5lib or BeautifulSoup. I have no
knowledge of any of these, and prefer to stay out of that discussion.

>> Instead, let's hope Linux distros pick it up (and if anyone knows how
>> to encourage that, let us know).
>
> At least all Debian based distros (such as Ubuntu) have it available. Not
> the latest, greatest version, but that will come. That said, it's never
> been a real problem to EasyInstall lxml directly from PyPI onto any decent
> Linux distribution. MacOS-X is a far more tricky target here, not to say a
> word about Windows (C-compiler? anyone?).
>
> I would expect that even if lxml itself was in the stdlib, Linux
> distributions would (want to) build it against their system libraries.
> Static builds would only be required on MacOS-X and Windows.

And that in itself is one of the main arguments against inclusion in
the stdlib, since it adds a whole new level of complexity to the
compatibility matrix. E.g. assume that some newer version of libxml2
has a new feature. You can wrap that feature with an API in lxml, but
now you require that newer libxml2 version as a dependency. Since the
distros don't support that they either are prevented from providing
the corresponding newer version of Python or you will have to make the
lxml code conditional on the presence or absence of that API. The
latter is preferable, but now it means that Python users can't rely on
that API being present even if they have the right version of Python.
It's a mess. Requiring a 3rd party download makes this cleaner,
because you decouple the llibxml2/lxml versioning from the Python
version.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)