[ python-Bugs-1051840 ] HTMLParser doesn't treat endtags in
<script> tags as CDATA
SourceForge.net
noreply at sourceforge.net
Fri Oct 22 01:02:04 CEST 2004
Bugs item #1051840, was opened at 2004-10-21 16:02
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1051840&group_id=5470
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Luke Bradley (neptune235)
Assigned to: Nobody/Anonymous (nobody)
Summary: HTMLParser doesn't treat endtags in <script> tags as CDATA
Initial Comment:
HTMLParser.HTMLParser in Python 2.3.4 calls
self.handle_endtag() for end tags within script and
style sections, which it should not, because the
content is supposed to be CDATA, as defined in
CDATA_CONTENT_ELEMENTS within HTMLParser. The following
script will demonstrate this problem:
import HTMLParser
class MyHandler(HTMLParser.HTMLParser):
tags = []
def handle_starttag(self, tag, attr):
self.tags.append(tag)
def handle_endtag(self, tag):
if tag != self.tags[-1]:
#this should never happen in a well formed
document
raise "Not well-formed, endtag '" + tag +
"' doesn't match starttag '" + self.lasttag + "'"
self.tags.pop(-1)
s = """
<html>
<body>
This page is completely well formed
<script language="javascript">
alert("</a></a>");
</script>
blah blah
</body>
</html>
"""
m = MyHandler()
m.feed(s)
This will raise an exception. I fixed the bug by
changing the parse_endtag function on line 318 of
HTMLParser to the following:
def parse_endtag(self, i):
rawdata = self.rawdata
assert rawdata[i:i+2] == "</", "unexpected call to
parse_endtag"
match = endendtag.search(rawdata, i+1) # >
if not match:
return -1
j = match.end()
match = endtagfind.match(rawdata, i) # </ + tag + >
if not match:
self.error("bad end tag: %s" % `rawdata[i:j]`)
tag = match.group(1)
#START BUGFIX
if self.interesting == interesting_cdata:
#we're in of of the CDATA_CONTENT_ELEMENTS
if tag == self.lasttag and tag in
self.CDATA_CONTENT_ELEMENTS:
#its the end of the CDATA_CONTENT_ELEMENTS
tag we are in.
self.handle_endtag(tag.lower())
self.clear_cdata_mode()#backto normal mode
else:
#we're inside the CDATA_CONTENT_ELEMENTS
tag still. throw the tag to handle_data instead.
self.handle_data(match.group())
else:
#we're not in a CDATA_CONTENT_ELEMENTS tag.
standard ending:
self.handle_endtag(tag.lower())
return j
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1051840&group_id=5470
More information about the Python-bugs-list
mailing list