Python HTML parser chokes on UTF-8 input

Thu Oct 9 18:13:36 EDT 2008

Terry Reedy schrieb:
> Johannes Bauer wrote:
>> Hello group,
>>
>> I'm trying to use a htmllib.HTMLParser derivate class to parse a website
>> which I fetched via
>> httplib.HTTPConnection().request().getresponse().read(). Now the problem
>> is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
>> code is something like this:
> 
> I believe you are confusing unicode with unicode encoded into bytes with
> the UTF-8 encoding.  Having a problem feeding a unicode string, not
> 'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string.

I also believe I am. Could you please elaborate further?

Do I understand correctly when saying that type 'str' has no associated
default encoding, but type 'unicode' does? Does this mean that really
the only way of coping with that stuff is doing what I've been doing?

>> This is incredibly ugly IMHO, as I would really like the parser to just
>> accept UTF-8 input.
> 
> To me, code that works is prettier than code that does not.
> 
> In 3.0, text strings are unicode, and I believe that is what the parser
> now accepts.

Well, yes, I suppose working code is nicer than non-working code.
However I am sure you will agree that explicit encoding conversions are
cumbersome and error-prone.

>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0:
>> ordinal not in range(128)
> 
> When you do not bother to specify some other encoding in an encoding
> operation, sgmllib or something deeper in Python tries the default
> encoding, which does not work.  Stop being annoyed and tell the
> interpreter what you want.  It is not a mind-reader.

How do I tell the interpreter to parse the strings I pass to it as
unicode? The way I did or is there some better way?

>> Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
>> input - which should (again, IMHO) be the absolute standard for such a
>> new language.
> 
> The first version of Python came out in 1989, I believe, years before
> unicode.  One of the features of the new 3.0 version is that is uses
> unicode as the standard for text.

Hmmm. I suppose you're right there. Python 3.0 really sounds quite nice,
do you know when will approximately be ready?

Regards,
Johannes

-- 
"Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
         -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
                         <48d8bf1d$0$7510$5402220f at news.sunrise.ch>