[Python-Dev] Adding .decode() method to Unicode

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 12 Jun 2001 13:00:40 +0200


> > > str.encode()
> > > str.decode()
> > > uni.encode()
> > > #uni.decode() # still missing
> > 
> > It's not missing. str.decode and uni.encode go through a single codec;
> > that's easy. str.encode is somewhat more confusing, because it really
> > is unicode(str).encode. Now, you are not proposing that uni.decode is
> > str(uni).decode, are you?
> 
> No. uni.decode() will (just like the other methods) directly
> interface to the codecs decoder -- there is no magic conversion
> involved. It is meant to be used by Unicode-Unicode codecs

When invoking "Hallo".encode("utf-8"), two conversions are executed:
first the default decoding into Unicode, then the UTF-8 encoding. Of
course, that is not the intended use (but then, is the intended use
documented anywhere?): instead, people should write
"Hallo".encode("base64") instead. This is an example I can understand,
although I'm not sure why it is inherently better to write this
instead of writing base64.encodestring("Hallo").

> > If not that, what else would it mean? And if it means something else,
> > it is clearly not symmetric to str.encode, so it is not "missing".
> 
> It is in the sense that strings support this method and Unicode
> currently doesn't.

The rationale for string.encode is weak: it argues that string->string
conversions are frequent enough to justify this API, even though these
conversions have nothing to do with coded character sets.

So far, I can see *no* rationale for unicode.decode.

> There's no need for a PEP. This addition is much too simple
> to require a PEP on its own.

PEP 1 says:

# We intend PEPs to be the primary mechanisms for proposing new
# features, for collecting community input on an issue, and for
# documenting the design decisions that have gone into Python.  The
# PEP author is responsible for building consensus within the
# community and documenting dissenting opinions.

So we have a proposal for a new feature, and we have dissenting
opinions. Who are you to decide that this additions is too simple to
require a PEP on its own?

> As for use cases: I have already given a whole bunch of them
> (Unicode compression, normalization, escaping in various ways).

I was asking for specific examples: Names of specific codecs that you
want to implement, and application code fragments using these specific
codecs. I don't know how to use Unicode compression if I had such this
proposed feature, for example. I know what XML escaping is, and I
cannot see how this feature would help.

> True, but not all XML text out there is meant for XML parsers to 
> read ;-). Preprocessing of e.g. XML text in Python is a rather common
> thing to do and this is what the direct codec access methods are
> meant for.

Can you give an example of an application which processes XML without
a parser, but with converting character entities (preferably
open-source, so I can study its code)? I wonder whether they get CDATA
sections right... MAL, I really mean that: Please don't make claims
that something is common or useful without giving an *exact* example.

Regards,
Martin

P.S. This insistence on adding Unicode and string methods makes it
appear as if the author of the codecs module now thinks that the API
of it sucks.