Validating HTML from Python

Martin von Loewis loewis at informatik.hu-berlin.de
Wed Sep 26 12:39:55 EDT 2001


Magnus Lycka <magnus at thinkware.se> writes:

> Primarily, I just need something that complains unless a
> string is a correct HTML 4.0 document. It seems that it should
> be possible to do this in a simpler way than by writing a
> script that uploads the string to w3c's validator and
> then parses the resulting file for error messages...

I recommend to use nsgmls. To make this work, you need a doctype
declaration, and a proper public identifier, but I guess this is part
of any valid HTML document, anyway. Plus, you need the DTD, and you
need it registered in the catalog. On a typical Linux system with
nsgmls installed, it will automatically find the DTD.

nsgmls will output the error message in an easy-to-parse format on
stderr.

> Secondly, I suppose it would be useful to use SAX etc
> to check that the content of the files follow some of
> my expectations, but that's not my primary concern.

If you install PyXML, you get xmlproc, which is a validating XML
parser. Unfortunately, it won't validate arbitrary SGML, so it could
be used only to validate XHTML.

> This doesn't have to be a "pure python" solution as
> long as it's simple to access from python, and works
> on Windows and Linux.

I think getting a nsgmls binary for Windows is feasible, although
configuring the catalogs may be tricky.

Regards,
Martin




More information about the Python-list mailing list