[Python-Dev] htmllib vs. HTMLParser

Guido van Rossum guido at python.org
Mon Oct 27 11:52:53 EST 2003


> Over in the Web SIG, it was noted that the HTML parser in htmllib has
> handlers for HTML 2.0 elements, and it should really support HTML 4.01, the
> current version.  I'm looking into doing this.
> 
> We actually have two HTML parsers: htmllib.py and the more recent
> HTMLParser.py.  The initial check-in comment for 2001/05/18 for
> HTMLParser.py reads:
> 
>       A much improved HTML parser -- a replacement for sgmllib.  The API is
>       derived from but not quite compatible with that of sgmllib, so it's a
>       new file.  I suppose it needs documentation, and htmllib needs to be
>       changed to use this instead of sgmllib, and sgmllib needs to be
>       declared obsolete.  But that can all be done later.
> 
> sgmllib only handles those bits of SGML needed for HTML, and anyone doing
> serious SGML work is going to have to use a real SGML parser, so deprecating 
> sgmllib is reasonable.  HTMLParser needs no changes for HTML 4.01; only
> htmllib needs to get a bunch more handler methods.
> 
> Should I try to do this for 2.4?

I'm unclear on what you plan to do -- repeal sgmllib an rewrite
htmllib to use HTMLParser internally for a backwards compatible
interface?

> (I can't find an explanation of how the API differs between the two modules
> but can figure it out by inspecting the code, and will try to keep the
> htmllib module backward-compatible.)

That would be required for a few releases, yes.

I'm okay with deprecating sgmllib faster than htmllib.

--Guido van Rossum (home page: http://www.python.org/~guido/)



More information about the Python-Dev mailing list