[Python-Dev] HTMLParser and HTML5

Fri Jul 29 21:00:01 CEST 2011

On Fri, Jul 29, 2011 at 11:03 AM, Glyph Lefkowitz
<glyph at twistedmatrix.com>wrote:

>
> On Jul 29, 2011, at 7:46 AM, Stefan Behnel wrote:
>
> > Joao S. O. Bueno, 29.07.2011 13:22:
> >> On Fri, Jul 29, 2011 at 1:37 AM, Stefan Behnel wrote:
> >>> Brett Cannon, 28.07.2011 23:49:
> >>>>
> >>>> On Thu, Jul 28, 2011 at 11:25, Matt wrote:
> >>>>>
> >>>>> - What policies are in place for keeping parity with other HTML
> >>>>> parsers (such as those in web browsers)?
> >>>>
> >>>> There aren't any beyond "it would be nice".
> >>>> [...]
> >>>> It's more of an issue of someone caring enough to do the coding work
> to
> >>>> bring the parser up to spec for HTML5 (or introduce new code to live
> >>>> beside
> >>>> the HTML4 parsing code).
> >>>
> >>> Which, given that html5lib readily exists, would likely be a lot more
> work
> >>> than anyone who is interested in HTML5 handling would want to invest.
> >>>
> >>> I don't think we need a new HTML5 parsing implementation only to have
> it in
> >>> the stdlib. That's the old sunny Java way of doing it.
> >>
> >> I disaagree.
> >> Having proper html parsing out of the box is part of the "batteries
> >> included" thing.
> >
> > Well, you can easily prove me wrong by implementing this.
>

As far as the issue described in my initial message goes, there is a patch
and tests for the patch.

>
> Please don't implement this just to profe Stefan wrong :).
>
> The thing to do, if you want html parsing in the stdlib, is to
> _incorporate_ html5lib, which is already a perfectly good, thoroughly tested
> HTML parser, and simply deprecate HTMLParser and friends.  Implementing a
> new parser would serve no purpose I can see.
>

I don't see any real reason to drop a decent piece of code (HTMLParser, that
is) in favor of a third party library when only relatively minor updates are
needed to bring it up to speed with the latest spec. As far as structure
goes, HTML4 and HTML5 are practically identical. The differences between the
two that are applicable to HTMLParser involve the way the specs deal with
special element types and broken syntax. For what it's worth, the rules
HTML4 does define are (in many cases) ignored in favor of more modern,
Postel's Law-agreeable rules. HTML5 simply standardized what browsers
actually do.

Deprecating HTMLParser in favor of a newer/better/faster HTML library is a
bad thing for everybody that's already using HTMLParser, whether directly or
indirectly. html5lib does not have an interface compatible with HTMLParser,
so code would largely need to be rewritten from scratch to gain the benefits
of HTML5's support for broken code. Developers using HTMLParser would be
permanently stuck using a library that throws exceptions for perfectly valid
HTML. Keep in mind that these are solved problems: all of the thinking on
how to handle broken code has been done for us by the folks at the WHATWG.
It's simply a matter of updating our existing code with these new rules.

While I agree that there are merits to dropping support for the old code, it
does not solve the existing problems that folks are having right now (namely
incorrect parser output or exceptions). It would be more ideal to perhaps
patch the obvious issues stemming from HTML4 support for now, leaving
anything that goes beyond parity with browsers for a later time or
implementing as an opt-in feature (i.e.: enabled by a parameter).

Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110729/f773911d/attachment.html>