[Python-Dev] "data".decode(encoding) ?!

M.-A. Lemburg mal@lemburg.com
Sun, 13 May 2001 18:53:55 +0200


Michael Hudson wrote:
> 
> "M.-A. Lemburg" <mal@lemburg.com> writes:
> 
> > Fredrik Lundh wrote:
> > > can you take that again?  shouldn't michael's example be
> > > equivalent to:
> > >
> > >     unicode(u"\u00e3".encode("latin-1"), "latin-1")
> > >
> > > if not, I'd argue that your "decode" design is broken, instead
> > > of just buggy...
> >
> > Well, it is sort of broken, I agree. The reason is that
> > PyString_Encode() and PyString_Decode() guarantee the returned
> > object to be a string object. To be able to reuse Unicode codecs
> > I added code which converts Unicode back to a string in case the
> > codec return an Unicode object (which the .decode() method does).
> > This is what's failing.
> 
> It strikes me that if someone executes
> 
> aString.decode("latin-1")
> 
> they're going to expect a unicode string.  AIUI, what's currently
> happening is that the string is converted from a latin-1 8-bit string
> to the 16-bit unicode string I expected and then there is an attempt
> to convert it back to an 8-bit string using the default encoding.  So
> if I'd done a
> 
> sys.setdefaultencoding("latin-1")
> 
> in my sitecustomize.py, then aString.decode("latin-1") would just be
> aString again?  This doesn't seem optimal.

True and that's why I am proposing to losen the restriction 
on having the two APIs returning strings only.
 
> > Perhaps I should simply remove the restriction and have both APIs
> > return the codec's return object as-is ?! (I would be in favour of
> > this, but I'm not sure whether this is already in use by someone...)
> 
> Are all the codecs ditributed with Python 2.1 unicode-related?  If
> that's the case, PyString_Decode isn't terribly useful is it?  It
> seems unlikely that it received much use.  Could be wrong of course.

All standard codecs in 2.0 and 2.1 are Unicode related. I am
planning to write up a bunch of string-to-string codecs next
week though which will then be the first non-Unicode related
codecs in 2.2.

> OTOH, maybe I'm trying to wedge to much behaviour onto a a particular
> operation.  Do we want
> 
> open(file).read().decode("jpeg") -> some kind of PIL object
> 
> to be possible?

This would be possible indeed. Even though some may find this
coding style obscure, I think this technique has the same
usefulness as e.g. piping at OS level.

I am thinking of these use cases:

"äöü".decode("latin-1") -> Unicode (object construction)
"...jpeg data...".decode("jpeg") -> JpegImage object (dito)
"äöü".decode("latin-1").encode("cp1521") -> string (recoding data)
"...long data...".encode("gzip") -> string (transfer encoding)
"...gzipped data...".decode("gzip") -> string (transfer decoding)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/