[Python-Dev] Changes in html.parser may cause breakage in client code

Ezio Melotti ezio.melotti at gmail.com
Fri Apr 27 07:23:13 CEST 2012


Hi,

On 26/04/2012 22.10, Vinay Sajip wrote:
> Following recent changes in html.parser, the Python 3 port of Django I'm working
> on has started failing while parsing HTML.
>
> The reason appears to be that Django uses some module-level data in html.parser,
> for example tagfind, which is a regular expression pattern. This has changed
> recently (Ezio changed it in ba4baaddac8d).

html.parser doesn't use any private _name, so I was considering part of 
the public API only the documented names.  Several methods are marked 
with an "# internal" comment, but that's not visible unless you go read 
the source code.

> Now tagfind (and other such patterns) are not marked as private (though not
> documented), but should they be? The following script (tagfind.py):
>
>      import html.parser as Parser
>
>      data = '<select name="stuff">'
>
>      m = Parser.tagfind.match(data, 1)
>      print('%r ->  %r' % (Parser.tagfind.pattern, data[1:m.end()]))
>
> gives different results on 3.2 and 3.3:
>
>      $ python3.2 tagfind.py
>      '[a-zA-Z][-.a-zA-Z0-9:_]*' ->  'select'
>      $ python3.3 tagfind.py
>      '([a-zA-Z][-.a-zA-Z0-9:_]*)(?:\\s|/(?!>))*' ->  'select'
>
> The trailing space later causes a mismatch with the end tag, and leads to the
> errors. Django's use of the tagfind pattern is in a subclass of HTMLParser, in
> an overridden parse_startag method.

Django shouldn't override parse_starttag (internal and undocumented), 
but just use handle_starttag (public and documented).
I see two possible reasons why it's overriding parse_starttag:
  1) Django is working around an HTMLParser bug.  In this case the bug 
could have been fixed (leading to the breakage of the now-useless 
workaround), and now you could be able to use the original 
parse_starttag and have the correct result.  If it is indeed working 
around a bug and the bug is still present, you should report it upstream.
  2) Django is implementing an additional feature.  Depending on what 
exactly the code is doing you might want to open a new feature request 
on the bug tracker. For example the original parse_starttag sets a 
self.lasttag attribute with the correct name of the last tag parsed.  
Note however that both parse_starttag and self.lasttag are internal and 
shouldn't be used directly (but lasttag could be exposed and documented 
if people really think that it's useful).

> Do we need to indicate more strongly that data like tagfind are private? Or has
> the change introduced inadvertent breakage, requiring a fix in Python?

I'm not sure that reverting the regex, deprecate all the exposed 
internal names, and add/use internal _names instead is a good idea at 
this point.  This will cause more breakage, and it would require an 
extensive renaming.  I can add notes to the documentation/docstrings and 
specify what's private and what's not though.
OTOH, if this specific fix is not released yet I can still do something 
to limit/avoid the breakage.

Best Regards,
Ezio Melotti

> Regards,
>
> Vinay Sajip
>



More information about the Python-Dev mailing list