Ask how to use HTMLParser

Sun Jan 10 03:16:28 EST 2010

On Fri, 08 Jan 2010 11:44:48 +0800, Water Lin wrote:

> I am a new guy to use Python, but I want to parse a html page now. I
> tried to use HTMLParse. Here is my sample code:
> ----------------------
> from HTMLParser import HTMLParser

Note that HTMLParser only tokenises HTML; it doesn't actually *parse* it.
You just get a stream of tag, text, entity, text, tag, ..., not a parse
tree.

In particular, if an element has its start and/or end tags omitted, you
won't get any notification about the start and/or end of the element;
you have to figure that out yourself from the fact that you're getting a
tag which wouldn't be allowed outside or inside the element.

E.g. if the document has omitted </p> tags, if you get a <p> tag when
you are (or *thought* that you were) already within a paragraph, you can
infer the omitted </p> tag.

If you want an actual parser, look at BeautifulSoup. This also does
a good job of handling invalid HTML (which seems to be far more
common than genuine HTML).