[Python-Dev] Integrate lxml into the stdlib? (was: Integrate BeautifulSoup into stdlib?)

Stefan Behnel stefan_ml at behnel.de
Thu Mar 5 20:22:50 CET 2009


Hi Guido,

I'm happy to see you jump onto this.

Guido van Rossum wrote:
> No matter how beautiful and fast lxml is, it has one downside where it
> comes to installing it into the stdlib: it is based on large, complex
> 3rd party libraries, libxml2 and libxslt.

I actually had a recent discussion with other lxml developers and we were
fast to agree that that would be the main problem. lxml itself is only some
18K lines of Cython code (which translates into 180K lines of C code) and
less than 7K lines of Python code, but libxml2 and libxslt add up to about
230K lines of C code just by themselves. That is definitely far from
trivial and it's hard to guarantee that bugs in these libraries will never
lead to security holes in a Python release, for example.

Still, it does provide an awful lot of things that the stdlib currently
fails to deliver in one way or another, some even completely. XPath, XSLT,
XML validation and (above all) real-world HTML parsing come to mind. I
definitely stopped counting the posts on c.l.py about HTMLParser not being
able to parse a specific web page.

It's good that (c)ElementTree is part of the stdlib, and it's also good
that there is a rather smooth upgrade path towards lxml. But lxml is by
itself becoming more and more a critical dependency of web related packages
and applications, simply because it provides everything in one tool. And
even if I wasn't the author of lxml, I would have a hard time feeling happy
if a real-world HTML parser was added to the stdlib that provides a totally
different interface than the best (and fastest) XML library that the stdlib
currently has.


> Instead, let's hope Linux distros pick it up (and if anyone knows how
> to encourage that, let us know).

At least all Debian based distros (such as Ubuntu) have it available. Not
the latest, greatest version, but that will come. That said, it's never
been a real problem to EasyInstall lxml directly from PyPI onto any decent
Linux distribution. MacOS-X is a far more tricky target here, not to say a
word about Windows (C-compiler? anyone?).

I would expect that even if lxml itself was in the stdlib, Linux
distributions would (want to) build it against their system libraries.
Static builds would only be required on MacOS-X and Windows.

Stefan



More information about the Python-Dev mailing list