encoding problem

Fri Dec 19 07:22:33 EST 2008

digisatori at gmail.com a écrit :
> The below snippet code generates UnicodeDecodeError.
> #!/usr/bin/env python
> #--*-- coding: utf-8 --*--
> s = 'äöü'
> u = unicode(s)
> 
> 
> It seems that the system use the default encoding- ASCII to decode the
> utf8 encoded string literal, and thus generates the error.

Indeed. You want:

u = unicode(s, 'utf-8') # or : u = s.decode('utf-8')

> The question is why the Python interpreter use the default encoding
> instead of "utf-8", which I explicitly declared in the source.

Because there's no reliable way for the interpreter to guess how what's 
passed to unicode has been encoded ?

s = s.decode("utf-8").encode("latin1")
# should unicode try to use utf-8 here ?
try:
   u = unicode(s)
except UnicodeDecodeError:
   print "would have worked better with "u = unicode(s, 'latin1')"

NB : IIRC, the ascii subset is safe whatever the encoding, so I'd say 
it's a sensible default...