Python HTML parser chokes on UTF-8 input

Fri Oct 10 03:03:08 EDT 2008

On Fri, 10 Oct 2008 00:13:36 +0200, Johannes Bauer wrote:

> Terry Reedy schrieb:
>> I believe you are confusing unicode with unicode encoded into bytes
>> with the UTF-8 encoding.  Having a problem feeding a unicode string,
>> not 'UFT-8 code', which in Python can only mean a UTF-8 encoded byte
>> string.
> 
> I also believe I am. Could you please elaborate further?
> 
> Do I understand correctly when saying that type 'str' has no associated
> default encoding, but type 'unicode' does?

`str` doesn't know an encoding.  The content could be any byte data 
anyway.  And `unicode` doesn't know an encoding either, it is unicode 
characters.  How they are represented internally is not the business of 
the programmer.  If you want operate with unicode characters you have to 
decode a byte string (`str`) with the appropriate encoding.  If you want 
feed `unicode` to something that expects bytes and not unicode characters 
you have to encode again.

>>> This is incredibly ugly IMHO, as I would really like the parser to
>>> just accept UTF-8 input.

It accepts UTF-8 input but not `unicode` objects.

> However I am sure you will agree that explicit encoding conversions are
> cumbersome and error-prone.

But implicit conversions are impossible because the interpreter doesn't 
know which encoding to use and refuses to guess.  Implicit and guessed 
conversions are error prone too.

Ciao,
	Marc 'BlackJack' Rintsch