Python HTML parser chokes on UTF-8 input
John Nagle
nagle at animats.com
Fri Oct 17 11:55:03 EDT 2008
Johannes Bauer wrote:
> Hello group,
>
> I'm trying to use a htmllib.HTMLParser derivate class to parse a website
> which I fetched via
> httplib.HTTPConnection().request().getresponse().read(). Now the problem
> is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
> code is something like this:
Try BeautifulSoup. It actually understands how to detect the encoding
of an HTML file (there are three different ways that information can be
expressed), and will shift modes accordingly.
This is an ugly problem. Sometimes, it's necessary to parse part of
the file, discover that the rest of the file has a non-ASCII encoding,
and restart the parse from the beginning. BeautifulSoup has the
machinery for that.
John Nagle
More information about the Python-list
mailing list