HTMLParser fragility
Walter Dörwald
walter at livinglogic.de
Thu Apr 6 09:50:00 EDT 2006
Rene Pijlman wrote:
> Lawrence D'Oliveiro:
>> I've been using HTMLParser to scrape Web sites. The trouble with this
>> is, there's a lot of malformed HTML out there. Real browsers have to be
>> written to cope gracefully with this, but HTMLParser does not.
>
> There are two solutions to this:
>
> 1. Tidy the source before parsing it.
> http://www.egenix.com/files/python/mxTidy.html
>
> 2. Use something more foregiving, like BeautifulSoup.
> http://www.crummy.com/software/BeautifulSoup/
You can also use the HTML parser from libxml2 or any of the available
wrappers for it.
Bye,
Walter Dörwald
More information about the Python-list
mailing list