HTMLParser fragility

Rene Pijlman reply.in.the.newsgroup at my.address.is.invalid
Wed Apr 5 06:45:44 EDT 2006


Lawrence D'Oliveiro:
>I've been using HTMLParser to scrape Web sites. The trouble with this 
>is, there's a lot of malformed HTML out there. Real browsers have to be 
>written to cope gracefully with this, but HTMLParser does not. 

There are two solutions to this:

1. Tidy the source before parsing it.
http://www.egenix.com/files/python/mxTidy.html

2. Use something more foregiving, like BeautifulSoup.
http://www.crummy.com/software/BeautifulSoup/

-- 
René Pijlman



More information about the Python-list mailing list