HTMLParser fragility

John J. Lee jjl at pobox.com
Mon Apr 10 15:13:21 EDT 2006


"Lawrence D'Oliveiro" <ldo at geek-central.gen.new_zealand> writes:

> I've been using HTMLParser to scrape Web sites. The trouble with this 
> is, there's a lot of malformed HTML out there. Real browsers have to be 
> written to cope gracefully with this, but HTMLParser does not. Not only 
> does it raise an exception, but the parser object then gets into a 
> confused state after that so you cannot continue using it.
[...]

sgmllib.SGMLParser (or htmllib.HTMLParser) is more tolerant than
HTMLParser.HTMLParser.

BeautifulSoup derives from sgmllib.SGMLParser, and introduces extra
robustness, of a sort.


John




More information about the Python-list mailing list