Another BeautifulSoup crash on bad HTML

John Nagle nagle at
Thu May 15 01:33:14 EDT 2008

    Can't really blame BeautifulSoup for this, but our crawler hit a page
("") with an out of range character escape:


in this text:

  If you provide a name, email address and/or website and choose ‘Remember 
  me𔃷, these details will be stored as a cookie on your computer.

The author clearly meant "’", which is a single close quote.

The traceback as BeautifulSoup aborts:

SGMLParser.feed(self, markup or "")
File "/usr/local/lib/python2.5/", line 99, in feed
File "/usr/local/lib/python2.5/", line 181, in goahead
File "/var/www/vhosts/", line 
1250, in handle_charref
data = unichr(int(ref))
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

    Another item in our ongoing saga of "What happens when you parse real-world

    A try-block in handle_charref would be appropriate.

				John Nagle

More information about the Python-list mailing list