Any equivalent to Ruby's 'hpricot' html/xpath/css selector package?

Stefan Behnel stefan_ml at behnel.de
Tue Dec 30 08:26:37 EST 2008


Bruno Desthuilliers wrote:
>> However, what makes it really useful is that it does a good job of
>> handling the "broken" html that is so commonly found on the web.
> 
> BeautifulSoup ?
> http://pypi.python.org/pypi/BeautifulSoup/3.0.7a
> 
> possibly with ElementSoup ?
> http://pypi.python.org/pypi/ElementSoup/rev452

It's actually debatable if BS is any better than lxml/libxml2 when parsing
broken HTML, as lxml tends to tidy things up pretty well. The only major
difference is in encoding detection, for which you can also use a separate
tool like chardet:

http://chardet.feedparser.org/

Stefan



More information about the Python-list mailing list