intolerant HTML parser

Sat Feb 6 23:33:12 EST 2010

On Sat, 06 Feb 2010 11:09:31 -0800, Jim wrote:

> I generate some HTML and I want to include in my unit tests a check
> for syntax.  So I am looking for a program that will complain at any
> syntax irregularities.
> 
> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to cope with bad syntax.  I just tried feeding
> HTMLParser.HTMLParser some HTML containing '<p>a<b>b</p></b>' and it
> didn't complain.

HTMLParser is a tokeniser, not a parser. It treats the data as a
stream of tokens (tags, entities, PCDATA, etc); it doesn't know anything
about the HTML DTD. For all it knows, the above example could be perfectly
valid (the "b" element might allow both its start and end tags to be
omitted).

Does the validation need to be done in Python? If not, you can use
"nsgmls" to validate any SGML document for which you have a DTD. OpenSP
includes nsgmls along with the various HTML DTDs.