Python HTML parser chokes on UTF-8 input

Terry Reedy tjreedy at udel.edu
Thu Oct 9 23:08:40 EDT 2008


Johannes Bauer wrote:
> Terry Reedy schrieb:
>> Johannes Bauer wrote:
>>> Hello group,
>>>
>>> I'm trying to use a htmllib.HTMLParser derivate class to parse a website
>>> which I fetched via
>>> httplib.HTTPConnection().request().getresponse().read(). Now the problem
>>> is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
>>> code is something like this:
>> I believe you are confusing unicode with unicode encoded into bytes with
>> the UTF-8 encoding.  Having a problem feeding a unicode string, not
>> 'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string.
> 
> I also believe I am. Could you please elaborate further?

I am a unicode neophyte.  My source of info is the first 3 or so 
chapters of the unicode specification.
http://www.unicode.org/versions/Unicode5.1.0/
I recommend that or other sites for other questions.  It took me more 
than one reading of the same topics in different texts to pretty well 
'get it'

> Do I understand correctly when saying that type 'str' has no associated
> default encoding, but type 'unicode' does?

I am not sure what you mean.  Unicode strings in Python are internally 
stored in USC-2 or UCS-4 format.

 > Does this mean that really
> the only way of coping with that stuff is doing what I've been doing?

Having two text types in 2.x was necessary as a transition strategy but 
has also been something of a mess.  You did it one way.  Jerry gave you 
an alternative that I could not have explained.  Your choice.  Or use 3.0.

..
> Hmmm. I suppose you're right there. Python 3.0 really sounds quite nice,
> do you know when will approximately be ready?

For my current purposes, it is ready enough.  Developers *really* hope 
to get 3.0 final out by mid-December.  The schedule was pushed back 
because a) the outside world has not completely and cleanly switched to 
unicode text and b) some people who just started with the release 
candidate have found import bugs that earlier testers did not.  It still 
needs more testing from more different users (hint, hint).

Terry Jan Reedy




More information about the Python-list mailing list