[Python-Dev] "data".decode(encoding) ?!

M.-A. Lemburg mal@lemburg.com
Fri, 11 May 2001 12:07:40 +0200


Fredrik Lundh wrote:
> 
> mal wrote:
> 
> > > I may be being dense, but can you explain what's going on here:
> > >
> > > ->> u'\u00e3'.encode('latin-1')
> > > '\xe3'
> > > ->> u'\u00e3'.encode("latin-1").decode("latin-1")
> > > Traceback (most recent call last):
> > >   File "<input>", line 1, in ?
> > > UnicodeError: ASCII encoding error: ordinal not in range(128)
> >
> > The string.decode() method will try to reuse the Unicode
> > codecs here. To do this, it will have to convert the string
> > to Unicode first and this fails due to the character not being
> > in the ASCII range.
> 
> can you take that again?  shouldn't michael's example be
> equivalent to:
> 
>     unicode(u"\u00e3".encode("latin-1"), "latin-1")
> 
> if not, I'd argue that your "decode" design is broken, instead
> of just buggy...

Well, it is sort of broken, I agree. The reason is that 
PyString_Encode() and PyString_Decode() guarantee the returned
object to be a string object. To be able to reuse Unicode codecs
I added code which converts Unicode back to a string in case the
codec return an Unicode object (which the .decode() method does).
This is what's failing.

Perhaps I should simply remove the restriction and have both
APIs return the codec's return object as-is ?! (I would be in
favour of this, but I'm not sure whether this is already in use 
by someone...)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/