[Tutor] BeautifulSoup and Python 2.5
Kent Johnson
kent37 at tds.net
Wed Mar 7 14:56:09 CET 2007
This seems to be a problem with BeautifulSoup and Python 2.5. I spent
some time looking at it this morning and tracked down one problem. Below
is the email I sent to the BeautifulSoup maintainer.
I doubt that either of these problems will actually be a problem in
practice. I suggest you install it by copying the .py file to
site-packages and go ahead and use it.
Kent
==========================================================
Hi,
BeautifulSoup has a few problems with Python 2.5. Running the tests
gives this output:
................................./Users/kent/Desktop/Downloads/BeautifulSoup-3.0.3/BeautifulSoup.py:1654:
UnicodeWarning: Unicode equal comparison failed to convert both
arguments to Unicode - interpreting them as being unequal
elif data[:3] == '\xef\xbb\xbf':
/Users/kent/Desktop/Downloads/BeautifulSoup-3.0.3/BeautifulSoup.py:1657:
UnicodeWarning: Unicode equal comparison failed to convert both
arguments to Unicode - interpreting them as being unequal
elif data[:4] == '\x00\x00\xfe\xff':
/Users/kent/Desktop/Downloads/BeautifulSoup-3.0.3/BeautifulSoup.py:1660:
UnicodeWarning: Unicode equal comparison failed to convert both
arguments to Unicode - interpreting them as being unequal
elif data[:4] == '\xff\xfe\x00\x00':
.......F...........
======================================================================
FAIL: testQuotedAttributeValues (__main__.QuoteMeOnThat)
----------------------------------------------------------------------
Traceback (most recent call last):
File "BeautifulSoupTests.py", line 382, in testQuotedAttributeValues
'<this is="really messed up & stuff"></this>')
File "BeautifulSoupTests.py", line 19, in assertSoupEquals
self.assertEqual(str(c(toParse, convertEntities=convertEntities)), rep)
AssertionError: '<this is="really messed up & stuff"></this>' !=
'<this is="really messed up & stuff"></this>'
----------------------------------------------------------------------
Ran 52 tests in 0.208s
FAILED (failures=1)
The UnicodeWarnings seem to be caused by a change in how Python handles
mixed string comparisons. In Python 2.4, the comparison
u'' == '\xef\xbb\xbf'
raises
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
0: ordinal not in range(128)
In Python 2.5, the same comparison prints a warning but doesn't raise an
exception. For more information about this change, see the section
starting "A new warning, UnicodeWarning," on this page:
http://docs.python.org/whatsnew/other-lang.html
The affected code is in UnicodeDammit._toUnicode(). When BeautifulSoup()
is called with no text data, as happens a few times in the test suite,
_toUnicode() is called with an empty unicode string and triggers this
warning.
One way to fix this is to have UnicodeDammit.__init__() explicitly check
for an empty string and just return u"". Here is a suggested rewrite of
the initial portion of UnicodeDammit.__init__():
def __init__(self, markup, overrideEncodings=[],
smartQuotesTo='xml'):
self.markup, documentEncoding, sniffedEncoding = \
self._detectEncoding(markup)
self.smartQuotesTo = smartQuotesTo
self.triedEncodings = []
if markup=="" or isinstance(markup, unicode):
self.originalEncoding = None
self.unicode = unicode(markup)
return
Note that I have also changed the way this works if markup is already
unicode; the current implementation is incorrect, it returns a value
which is not allowed in __init__().
I don't know enough about the way BeautifulSoup works to figure out the
second one...
Best regards,
Kent
More information about the Tutor
mailing list