Python HTML parser chokes on UTF-8 input

Fri Oct 17 11:55:03 EDT 2008

Johannes Bauer wrote:
> Hello group,
> 
> I'm trying to use a htmllib.HTMLParser derivate class to parse a website
> which I fetched via
> httplib.HTTPConnection().request().getresponse().read(). Now the problem
> is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
> code is something like this:

    Try BeautifulSoup.  It actually understands how to detect the encoding
of an HTML file (there are three different ways that information can be
expressed), and will shift modes accordingly.

    This is an ugly problem.  Sometimes, it's necessary to parse part of
the file, discover that the rest of the file has a non-ASCII encoding,
and restart the parse from the beginning.  BeautifulSoup has the
machinery for that.

				John Nagle