HTMLParser fix

Dan Walton dusenetw4 at opti.cgi.net
Mon Aug 26 03:12:49 EDT 2002


I ran into a problem this evening getting the HTMLParser to parse an
html page with embedded script tags which contain end tag elements
inside the script.  Take the following html for instance:

<html>
<body>
<script>
<!--
document.write('<h1>testing</h1>');
-->
</script>
</body>
</html>


In the HTMLParser module which comes with Python 2.2.1chops this up
and returns the </h1> as a tag event when it should be part of a data
event.

The following patch should fix this problem:

98,99d97
<         self.cdata_mode = 0
<         self.cdata = []
126d123
<         self.cdata_mode = 1
130,132d126
<         self.handle_data(''.join(self.cdata))
<         self.cdata_mode = 0
<         self.cdata = []
148,152c142
<             if i < j:
<                 if(self.cdata_mode):
<                     self.cdata.append(rawdata[i:j])
<                 else:
<                     self.handle_data(rawdata[i:j])
---
>             if i < j: self.handle_data(rawdata[i:j])
160a151,152
>                     if k >= 0:
>                         self.clear_cdata_mode()
339,345d330
<         #print('parse_endtag[%s]' % tag)
<         if(self.cdata_mode):
<             if(tag.lower() in self.CDATA_CONTENT_ELEMENTS):
<                 self.clear_cdata_mode()
<             else:
<                 self.cdata.append(rawdata[i:j])
<                 return j








More information about the Python-list mailing list