[issue39833] Bug in html parsing module triggered by malformed input
Evan
report at bugs.python.org
Mon Mar 2 21:16:06 EST 2020
New submission from Evan <ep5880a at student.american.edu>:
Relevant base python library-- C:\Users\User\AppData\Local\Programs\Python\Python38\lib\_markupbase.py
The issue- After parsing over 900 SEC filings using beautifulsoup4, I get this user warning.
UserWarning: unknown status keyword 'ERF' in marked section
warnings.warn(msg)
Followed by a traceback
....
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python38\lib\site-packages\bs4\__init__.py", line 325, in __init__
self._feed()
....
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python38\lib\_markupbase.py", line 160, in parse_marked_section
if not match:
UnboundLocalError: local variable 'match' referenced before assignment
It's probably to due to malformed input from on of the docs.
144 lines into _markupbase lib we have:
# Internal -- parse a marked section
# Override this to handle MS-word extension syntax <![if word]>content<![endif]>
def parse_marked_section(self, i, report=1):
rawdata= self.rawdata
assert rawdata[i:i+3] == '<![', "unexpected call to parse_marked_section()"
sectName, j = self._scan_name( i+3, i )
if j < 0:
return j
if sectName in {"temp", "cdata", "ignore", "include", "rcdata"}:
# look for standard ]]> ending
match= _markedsectionclose.search(rawdata, i+3)
elif sectName in {"if", "else", "endif"}:
# look for MS Office ]> ending
match= _msmarkedsectionclose.search(rawdata, i+3)
else:
self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
if not match:
return -1
if report:
j = match.start(0)
self.unknown_decl(rawdata[i+3: j])
return match.end(0)
`match` should be set to None in the fall-through else statement right before `if not match`.
----------
components: Library (Lib)
messages: 363234
nosy: SanJacintoJoe
priority: normal
severity: normal
status: open
title: Bug in html parsing module triggered by malformed input
type: compile error
versions: Python 3.8
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue39833>
_______________________________________
More information about the Python-bugs-list
mailing list