[issue1486713] HTMLParser : A auto-tolerant parsing mode
kxroberto
report at bugs.python.org
Wed Nov 16 11:16:52 CET 2011
kxroberto <kxroberto at users.sourceforge.net> added the comment:
The old patch warned already the majority of real cases - except the missing white space between attributes.
"The tolerant regex will match both":
locatestarttagend_tolerant: The main and frequent issue on the web here is the missing white space between attributes (with enclosed values). And there is the new tolerant comma between attributes, which however I have not seen so far anywhere (the old warning machanism and attrfind.match would have already raised it at "junk chars ..." event.
Both issues can be easily warned (also/already) at quite no cost by the slightly extended regex below (when the 2 new non-pseudo regex groups are check against <>None in check_for_whole_start_tag).
Or missing whitespace could be warned (multiple times) at attrfind time.
attrfind_tolerant : I see no point in the old/"strict" attrfind. (and the difference is guessed 0.000% of real cases). attrfind_tolerant could become the only attrfind.
--
locatestarttagend_tolerant = re.compile(r"""
<[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
(?:(?:\s+|(\s*)) # optional whitespace before attribute name
(?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
(?:\s*=\s* # value indicator
(?:'[^']*' # LITA-enclosed value
|\"[^\"]*\" # LIT-enclosed value
|[^'\">\s]+ # bare value
)
(?:\s*(,))* # possibly followed by a comma
)?
)
)*
\s* # trailing whitespace
""", re.VERBOSE)
attrfind_tolerant = re.compile(
r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
r'(\'[^\']*\'|"[^"]*"|[^>\s]*))?')
#s='<abc a="b,+"c="d"e=f>text'
#s='<abc a="b,+" c="d"e=f>text'
s='<abc a="b,+",c="d" e=f>text'
m = locatestarttagend_tolerant.search(s)
print m.group()
print m.groups()
#if m.group(1) is not None: self.warning('space missing ...
#if m.group(2) is not None: self.warning('comma between attr...
m = attrfind_tolerant.search(s, 5)
print m.group()
print m.groups()
----------
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue1486713>
_______________________________________
More information about the Python-bugs-list
mailing list