Py 2.5: Bug in sgmllib

Michael Butscher mbutscher at gmx.de
Sun Oct 22 07:20:35 EDT 2006


Hi,

if I execute the following two lines in Python 2.5 (to feed in a 
*unicode* string):

import sgmllib
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')



I get the exception:

Traceback (most recent call last):
  File "<pyshell#10>", line 1, in <module>
    sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')
  File "C:\Programme\Python25\Lib\sgmllib.py", line 99, in feed
    self.goahead(0)
  File "C:\Programme\Python25\Lib\sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "C:\Programme\Python25\Lib\sgmllib.py", line 285, in 
parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0: 
ordinal not in range(128)



The reason is that the character reference ß is converted to 
*byte* string "\xdf" by SGMLParser.convert_codepoint. Adding this byte 
string to the remaining unicode string fails.


Workaround (not thoroughly tested): Override convert_codepoint in a 
derived class with:

    def convert_codepoint(self, codepoint):
        return unichr(codepoint)



Is this a bug or is SGMLParser not meant to be used for unicode strings 
(it should be documented then)?



Michael



More information about the Python-list mailing list