HTML Parsing

Sun Jun 29 05:23:09 EDT 2008

Stefan Behnel <stefan_ml at behnel.de>:

> disappearedng at gmail.com wrote:
>> I am trying to build my own web crawler for an experiement and I don't
>> know how to access HTTP protocol with python.
>>
>> Also, Are there any Opensource Parsing engine for HTML documents
>> available in Python too? That would be great.
> 
> Try lxml.html. It parses broken HTML, supports HTTP, is much faster than
> BeautifulSoup and threadable, all of which should be helpful for your
> crawler.

You should mention its powerful features like XPATH and CSS selection
support and its easy API here, too ;)

-- 
Freedom is always the freedom of dissenters.
                                      (Rosa Luxemburg)