HTMLParser fragility
John J. Lee
jjl at pobox.com
Mon Apr 10 15:13:21 EDT 2006
"Lawrence D'Oliveiro" <ldo at geek-central.gen.new_zealand> writes:
> I've been using HTMLParser to scrape Web sites. The trouble with this
> is, there's a lot of malformed HTML out there. Real browsers have to be
> written to cope gracefully with this, but HTMLParser does not. Not only
> does it raise an exception, but the parser object then gets into a
> confused state after that so you cannot continue using it.
[...]
sgmllib.SGMLParser (or htmllib.HTMLParser) is more tolerant than
HTMLParser.HTMLParser.
BeautifulSoup derives from sgmllib.SGMLParser, and introduces extra
robustness, of a sort.
John
More information about the Python-list
mailing list