HTMLParser fragility

Walter Dörwald walter at livinglogic.de
Thu Apr 6 09:50:00 EDT 2006


Rene Pijlman wrote:
> Lawrence D'Oliveiro:
>> I've been using HTMLParser to scrape Web sites. The trouble with this 
>> is, there's a lot of malformed HTML out there. Real browsers have to be 
>> written to cope gracefully with this, but HTMLParser does not. 
> 
> There are two solutions to this:
> 
> 1. Tidy the source before parsing it.
> http://www.egenix.com/files/python/mxTidy.html
> 
> 2. Use something more foregiving, like BeautifulSoup.
> http://www.crummy.com/software/BeautifulSoup/

You can also use the HTML parser from libxml2 or any of the available
wrappers for it.

Bye,
   Walter Dörwald




More information about the Python-list mailing list