sgmllib problem & proposed fix.

C. Titus Brown titus at caltech.edu
Fri Dec 17 03:29:20 EST 2004


Hi all,

while playing with PBP/mechanize/ClientForm, I ran into a problem with 
the way htmllib.HTMLParser was handling encoded tag attributes.

Specifically, the following HTML was not being handled correctly:

<option value="Small (6")">Small (6)</option>

The 'value' attr was being given the escaped value, not the
correct unescaped value, 'Small (6")'.

It turns out that sgmllib.SGMLParser (on which htmllib.HTMLParser is 
based) does not unescape tag attributes.  However, HTMLParser.HTMLParser 
(the newer, more XHTML-friendly class) does do so.

My proposed fix is to change sgmllib to unescape tags in the same way 
that HTMLParser.HTMLParser does.  A context diff to sgmllib.py from 
Python 2.4 is at the bottom of this message.

I'm posting to this newsgroup before submitting the patch because I'm 
not too familiar with these classes and I want to make sure this 
behavior is correct.

One question I had was this: as you can see from the code below, a 
simple string.replace is done to replace encoded strings with their 
unencoded translations.  Should handle_entityref be used instead, as 
with standard HTML text?

Another question: should this fix, if appropriate, be back-ported to 
older versions of Python?  (I doubt sgmllib has changed much, so it 
should be pretty simple to do.)

thanks for any advice,
--titus

*** /u/t/software/Python-2.4/Lib/sgmllib.py     2004-09-08 
18:49:58.000000000 -0700
--- sgmllib.py  2004-12-16 23:30:51.000000000 -0800
***************
*** 272,277 ****
--- 272,278 ----
               elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
                    attrvalue[:1] == '"' == attrvalue[-1:]:
                   attrvalue = attrvalue[1:-1]
+                 attrvalue = self.unescape(attrvalue)
               attrs.append((attrname.lower(), attrvalue))
               k = match.end(0)
           if rawdata[j] == '>':
***************
*** 414,419 ****
--- 415,432 ----
       def unknown_charref(self, ref): pass
       def unknown_entityref(self, ref): pass

+     # Internal -- helper to remove special character quoting
+     def unescape(self, s):
+         if '&' not in s:
+             return s
+         s = s.replace("<", "<")
+         s = s.replace(">", ">")
+         s = s.replace("'", "'")
+         s = s.replace(""", '"')
+         s = s.replace("&", "&") # Must be last
+
+         return s
+

   class TestSGMLParser(SGMLParser):



More information about the Python-list mailing list