[ python-Bugs-1051840 ] HTMLParser doesn't treat endtags in
<script> tags as CDATA
SourceForge.net
noreply at sourceforge.net
Sun Oct 24 02:46:24 CEST 2004
Bugs item #1051840, was opened at 2004-10-21 18:02
Message generated for change (Comment added) made by rhettinger
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1051840&group_id=5470
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Luke Bradley (neptune235)
>Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: HTMLParser doesn't treat endtags in <script> tags as CDATA
Initial Comment:
HTMLParser.HTMLParser in Python 2.3.4 calls
self.handle_endtag() for end tags within script and
style sections, which it should not, because the
content is supposed to be CDATA, as defined in
CDATA_CONTENT_ELEMENTS within HTMLParser. The following
script will demonstrate this problem:
import HTMLParser
class MyHandler(HTMLParser.HTMLParser):
tags = []
def handle_starttag(self, tag, attr):
self.tags.append(tag)
def handle_endtag(self, tag):
if tag != self.tags[-1]:
#this should never happen in a well formed
document
raise "Not well-formed, endtag '" + tag +
"' doesn't match starttag '" + self.lasttag + "'"
self.tags.pop(-1)
s = """
<html>
<body>
This page is completely well formed
<script language="javascript">
alert("</a></a>");
</script>
blah blah
</body>
</html>
"""
m = MyHandler()
m.feed(s)
This will raise an exception. I fixed the bug by
changing the parse_endtag function on line 318 of
HTMLParser to the following:
def parse_endtag(self, i):
rawdata = self.rawdata
assert rawdata[i:i+2] == "</", "unexpected call to
parse_endtag"
match = endendtag.search(rawdata, i+1) # >
if not match:
return -1
j = match.end()
match = endtagfind.match(rawdata, i) # </ + tag + >
if not match:
self.error("bad end tag: %s" % `rawdata[i:j]`)
tag = match.group(1)
#START BUGFIX
if self.interesting == interesting_cdata:
#we're in of of the CDATA_CONTENT_ELEMENTS
if tag == self.lasttag and tag in
self.CDATA_CONTENT_ELEMENTS:
#its the end of the CDATA_CONTENT_ELEMENTS
tag we are in.
self.handle_endtag(tag.lower())
self.clear_cdata_mode()#backto normal mode
else:
#we're inside the CDATA_CONTENT_ELEMENTS
tag still. throw the tag to handle_data instead.
self.handle_data(match.group())
else:
#we're not in a CDATA_CONTENT_ELEMENTS tag.
standard ending:
self.handle_endtag(tag.lower())
return j
----------------------------------------------------------------------
>Comment By: Raymond Hettinger (rhettinger)
Date: 2004-10-23 19:46
Message:
Logged In: YES
user_id=80475
Fred, what do you think?
----------------------------------------------------------------------
Comment By: Luke Bradley (neptune235)
Date: 2004-10-22 18:52
Message:
Logged In: YES
user_id=178561
<i>Although a fix may be worthwhile, as this happens a lot in
practice, HTMLParser is following the letter of the law in
throwing exceptions on pages that aren't strictly valid.
http://www.w3.org/TR/html4/appendix/notes.html#notes-
specifying-data</i>
Well you're right, I'll be damned!
Hmm. I want to use HTMLParser to access other people's pages
on the net, and I can't fix their bad HTML. The problem here
is I'm not sure how to handle this at the level of my
Handler, without inadvertantly changing thier javascript by
doing something like:
handle_data("</" + tag + ">")
in the handle_entag event. Which lowercases the tag. Is
there a way to access the literal string of the endtag in my
handler I wonder? If not, it might be useful to add some
functionality to HTMLParser that allows us to handle invalid
HTML at the level of our handler without sacrificing
HTMLParsers commitment to standards compliance.
----------------------------------------------------------------------
Comment By: Richard Brodie (leogah)
Date: 2004-10-22 13:02
Message:
Logged In: YES
user_id=356893
Although a fix may be worthwhile, as this happens a lot in
practice, HTMLParser is following the letter of the law in
throwing exceptions on pages that aren't strictly valid.
http://www.w3.org/TR/html4/appendix/notes.html#notes-
specifying-data
----------------------------------------------------------------------
Comment By: Luke Bradley (neptune235)
Date: 2004-10-21 18:04
Message:
Logged In: YES
user_id=178561
oops, I didn't know this would remove indentation. Let me
attach a file.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1051840&group_id=5470
More information about the Python-bugs-list
mailing list