[Python-Dev] HTMLParser and HTML5

Fri Jul 29 23:35:07 CEST 2011

On Fri, 29 Jul 2011 13:34:13 -0700
Brett Cannon <brett at python.org> wrote:
> On Fri, Jul 29, 2011 at 13:16, Glyph Lefkowitz <glyph at twistedmatrix.com>wrote:
> 
> > On Jul 29, 2011, at 3:00 PM, Matt wrote:
> >
> > I don't see any real reason to drop a decent piece of code (HTMLParser,
> > that is) in favor of a third party library when only relatively minor
> > updates are needed to bring it up to speed with the latest spec.
> >
> >
> > I am not really one to throw stones here, as Twisted contains a lenient
> > pseudo-XML parser which I still maintain - one which decidedly does *not* agree
> > with html5's requirements for dealing with invalid data, but just a bunch of
> > ad-hoc guesses of my own.
> >
> > My impression of HTML5 is that HTMLParser would require significant
> > modifications and possibly a drastic re-architecture in order to really do
> > HTML5 "right"; especially the parts that the html5lib authors claim makes
> > HTML5 streaming-unfriendly, i.e. subtree reordering when encountering
> > certain types of invalid data.
> >
> 
> We could also have the code live side-by-side for a while (or indefinitely
> if that was really desired) by bringing html5lib in as either a separate
> module or having the relevant classes live in htmllib under different names.

Unless html5lib is better in some fundamental ways which are difficult
to fix in htmllib, I'm not sure there's any point in adding it to the
stdlib.

We don't really do users a service if we keep adding alternative APIs
for common functionality.

Regards

Antoine.