intolerant HTML parser

Nobody nobody at nowhere.com
Sat Feb 6 23:33:12 EST 2010


On Sat, 06 Feb 2010 11:09:31 -0800, Jim wrote:

> I generate some HTML and I want to include in my unit tests a check
> for syntax.  So I am looking for a program that will complain at any
> syntax irregularities.
> 
> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to cope with bad syntax.  I just tried feeding
> HTMLParser.HTMLParser some HTML containing '<p>a<b>b</p></b>' and it
> didn't complain.

HTMLParser is a tokeniser, not a parser. It treats the data as a
stream of tokens (tags, entities, PCDATA, etc); it doesn't know anything
about the HTML DTD. For all it knows, the above example could be perfectly
valid (the "b" element might allow both its start and end tags to be
omitted).

Does the validation need to be done in Python? If not, you can use
"nsgmls" to validate any SGML document for which you have a DTD. OpenSP
includes nsgmls along with the various HTML DTDs.




More information about the Python-list mailing list