[Python-Dev] Re: Adding .decode() method to Unicode

Wed, 13 Jun 2001 10:41:37 -0400

Wow, this almost looks like a real flamefest.  ("Flame" being defined
as the presence of metacomments.)

(In the following, s is an 8-bit string, u is a Unicode string, and e
is an encoding name.)

The original design of the encode() methods of string and Unicode
objects (in 2.0 and 2.1) is asymmetric, and clearly geared towards
Unicode codecs only: to decode an 8-bit string you *have* to use
unicode(s, encoding) while to encode a Unicode string into a specific
8-bit encoding you *have* to use u.encode(e).  8-bit strings also have
an encode() method: s.encode(e) is the same as unicode(s).encode(e).
(This is useful since code that expects Unicode strings should also
work when it is passed ASCII-encoded 8-bit strings.)

I'd say there's no need for s.decode(e), since this can already be
done with unicode(s, e) -- and to me that API looks better since it
clearly states that the result is Unicode.

We *could* have designed the encoding API similarly: str(u, e) is
available, symmetric with unicode(s, e), and a logical extension of
str(u) which uses the default encoding.  But I accept the argument
that u.encode(e) is better because it emphasizes the encoding action,
and because it means no API changes to str().

I guess what I'm saying here is that 'str' does not give enough of a
clue that an encoding action is going on, while 'unicode' *does* give
a clue that a decoding action is being done: as soon as you read
"Unicode" you think "Mmm, encodings..." -- but "str" is pretty
neutral, so u.encode(e) is needed to give a clue.

Marc-Andre proposes (and has partially checked in) changes that
stretch the meaning of the encode() method, and add a decode() method,
to be basically interfaces to anything you can do with the codecs
module.  The return type of encode() and decode() is now determined by
the codec (formerly, encode() always returned an 8-bit string).  Some
new codecs have been added that do things like gzip and base64.

Initially, I liked this, and even contributed a codec.

But questions keep coming up.

What is the problem being solved?

True, the codecs module has a clumsy interface if you just want to
invoke a codec on some data.  But that can easily be remedied by
adding convenience functions encode() and decode() to codecs.py --
which would have the added advantage that it would work for other
datatypes that support the buffer interface,
e.g. codecs.encode(myPILobject, "base64").

True, the "codec" pattern can be used for other encodings than
Unicode.  But it seems to me that the entire codecs architecture is
rather strongly geared towards en/decoding Unicode, and it's not clear
how well other codecs fit in this pattern (e.g. I noticed that all the
non-Unicode codecs ignore the error handling parameter or assert that
it is set to 'strict').

Is it really right that x.encode("gzip") and x.encode("utf-8") look
similar, while the former requires an 8-bit string and the latter only
makes sense if x is a Unicode string?

Another (minor) issue is that Unicode encoding names are an IANA
namespace.  Is it wise to add our own names to this?

I'm not forcing a decision here, but I do ask that we consider these
issues before forging ahead with what might be a mistake.  A PEP would
be most helpful to focus the discussion.

--Guido van Rossum (home page: http://www.python.org/~guido/)