Python HTML parser chokes on UTF-8 input

John Nagle nagle at animats.com
Fri Oct 17 11:55:03 EDT 2008


Johannes Bauer wrote:
> Hello group,
> 
> I'm trying to use a htmllib.HTMLParser derivate class to parse a website
> which I fetched via
> httplib.HTTPConnection().request().getresponse().read(). Now the problem
> is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
> code is something like this:

    Try BeautifulSoup.  It actually understands how to detect the encoding
of an HTML file (there are three different ways that information can be
expressed), and will shift modes accordingly.

    This is an ugly problem.  Sometimes, it's necessary to parse part of
the file, discover that the rest of the file has a non-ASCII encoding,
and restart the parse from the beginning.  BeautifulSoup has the
machinery for that.

				John Nagle



More information about the Python-list mailing list