encode() question

Mon Aug 6 15:28:22 EDT 2007

On Tue, Jul 31, 2007 at 09:53:11AM -0700, 7stud wrote:
> s1 = "hello"
> s2 = s1.encode("utf-8")
> 
> s1 = "an accented 'e': \xc3\xa9"
> s2 = s1.encode("utf-8")
> 
> The last line produces the error:
> 
> ---
> Traceback (most recent call last):
>   File "test1.py", line 6, in ?
>     s2 = s1.encode("utf-8")
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 17: ordinal not in range(128)
> ---
> 
> The error is a "decode" error, and as far as I can tell, decoding
> happens when you convert a regular string to a unicode string.  So, is
> there an implicit conversion taking place from s1 to a unicode string
> before encode() is called?  By what mechanism?

Yep. You are trying to encode a string. The problem is that strings are
already encoded, so it generally makes no sense to call .encode() on
them. 

.encode()ing a string can be handy if you want to convert its encoding.
In such a case, though, Python will first convert the string to Unicode.
To do that, it has to know how the string is encoded. Unless you tell it
otherwise, Python assumes the string is encoded in ascii. You had a byte
in there that was out of ascii's range...thus, the error. Python was
trying to decode the string, assumed it was ascii, but that didn't work.

This is all very confusing; I'd highly recommend reading this bit about
Unicode. It started me down the difficult road of actually understanding
what is going on here.

http://www.joelonsoftware.com/articles/Unicode.html

-- 
It's another Baseline Boulder Morning.