[issue41748] HTMLParser: parsing error
STINNER Victor
report at bugs.python.org
Wed Sep 9 10:12:10 EDT 2020
STINNER Victor <vstinner at python.org> added the comment:
HTMLParser.check_for_whole_start_tag() uses locatestarttagend_tolerant regular expression to find the end of the start tag. This regex cuts the string at the first comma (","), but not if the comma is the first character of an attribute name
* '<div id="test" , color="blue">' => '<div id="test" , color="blue"': OK!
* '<div id="test" ,color="blue">' => '<div id="test" ,' => BUG
The regex is quite complex:
locatestarttagend_tolerant = re.compile(r"""
<[a-zA-Z][^\t\n\r\f />\x00]* # tag name
(?:[\s/]* # optional whitespace before attribute name
(?:(?<=['"\s/])[^\s/>][^\s/=>]* # attribute name
(?:\s*=+\s* # value indicator
(?:'[^']*' # LITA-enclosed value
|"[^"]*" # LIT-enclosed value
|(?!['"])[^>\s]* # bare value
)
(?:\s*,)* # possibly followed by a comma
)?(?:\s|/(?!>))*
)*
)?
\s* # trailing whitespace
""", re.VERBOSE)
endendtag = re.compile('>')
The problem is that this part of the regex:
#(?:\s*,)* # possibly followed by a comma
The comma is not seen as part of the attribute name.
----------
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue41748>
_______________________________________
More information about the Python-bugs-list
mailing list