[issue7311] Bug on regexp of HTMLParser

Tue Apr 5 20:51:55 CEST 2011

Ezio Melotti <ezio.melotti at gmail.com> added the comment:

With 3.2 the situation is more complicated because there is a strict and a non-strict mode.
The strict mode uses:
attrfind = re.compile(
    r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?')

and the tolerant mode uses:
attrfind_tolerant = re.compile(
    r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[^>\s]*))?')

This means that the strict mode doesn't allow valid non-ASCII chars, and that tolerant mode is a little too permissive.

The attached patch changes the strict regex to be more permissive and leaves the tolerant regex unchanged. The difference between the two are now so small that the tolerant version could be removed, except that re.search is used instead of re.match when the tolerant regex is used.

----------
nosy: +r.david.murray
Added file: http://bugs.python.org/file21545/issue7311-3.diff

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue7311>
_______________________________________