[Tutor] BeautifulSoup and Python 2.5

Kent Johnson kent37 at tds.net
Wed Mar 7 14:56:09 CET 2007


This seems to be a problem with BeautifulSoup and Python 2.5. I spent 
some time looking at it this morning and tracked down one problem. Below 
is the email I sent to the BeautifulSoup maintainer.

I doubt that either of these problems will actually be a problem in 
practice. I suggest you install it by copying the .py file to 
site-packages and go ahead and use it.

Kent

==========================================================

Hi,

BeautifulSoup has a few problems with Python 2.5. Running the tests 
gives this output:

................................./Users/kent/Desktop/Downloads/BeautifulSoup-3.0.3/BeautifulSoup.py:1654: 
UnicodeWarning: Unicode equal comparison failed to convert both 
arguments to Unicode - interpreting them as being unequal
   elif data[:3] == '\xef\xbb\xbf':
/Users/kent/Desktop/Downloads/BeautifulSoup-3.0.3/BeautifulSoup.py:1657: 
UnicodeWarning: Unicode equal comparison failed to convert both 
arguments to Unicode - interpreting them as being unequal
   elif data[:4] == '\x00\x00\xfe\xff':
/Users/kent/Desktop/Downloads/BeautifulSoup-3.0.3/BeautifulSoup.py:1660: 
UnicodeWarning: Unicode equal comparison failed to convert both 
arguments to Unicode - interpreting them as being unequal
   elif data[:4] == '\xff\xfe\x00\x00':
.......F...........
======================================================================
FAIL: testQuotedAttributeValues (__main__.QuoteMeOnThat)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "BeautifulSoupTests.py", line 382, in testQuotedAttributeValues
     '<this is="r&#101;ally messed up &amp; stuff"></this>')
   File "BeautifulSoupTests.py", line 19, in assertSoupEquals
     self.assertEqual(str(c(toParse, convertEntities=convertEntities)), rep)
AssertionError: '<this is="really messed up &amp; stuff"></this>' != 
'<this is="r&#101;ally messed up &amp; stuff"></this>'

----------------------------------------------------------------------
Ran 52 tests in 0.208s

FAILED (failures=1)


The UnicodeWarnings seem to be caused by a change in how Python handles 
mixed string comparisons. In Python 2.4, the comparison
   u'' == '\xef\xbb\xbf'
raises
   UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 
0: ordinal not in range(128)

In Python 2.5, the same comparison prints a warning but doesn't raise an 
exception. For more information about this change, see the section 
starting "A new warning, UnicodeWarning," on this page:
http://docs.python.org/whatsnew/other-lang.html

The affected code is in UnicodeDammit._toUnicode(). When BeautifulSoup() 
is called with no text data, as happens a few times in the test suite, 
_toUnicode() is called with an empty unicode string and triggers this 
warning.

One way to fix this is to have UnicodeDammit.__init__() explicitly check 
for an empty string and just return u"". Here is a suggested rewrite of 
the initial portion of UnicodeDammit.__init__():
     def __init__(self, markup, overrideEncodings=[],
                  smartQuotesTo='xml'):
         self.markup, documentEncoding, sniffedEncoding = \
                      self._detectEncoding(markup)
         self.smartQuotesTo = smartQuotesTo
         self.triedEncodings = []
         if markup=="" or isinstance(markup, unicode):
             self.originalEncoding = None
             self.unicode = unicode(markup)
             return

Note that I have also changed the way this works if markup is already 
unicode; the current implementation is incorrect, it returns a value 
which is not allowed in __init__().


I don't know enough about the way BeautifulSoup works to figure out the 
second one...

Best regards,
Kent



More information about the Tutor mailing list