[Python-Dev] transform() and untransform() methods, and the codec registry

Tue Dec 7 05:46:54 CET 2010

On Sun, Dec 5, 2010 at 5:25 PM, Victor Stinner
<victor.stinner at haypocalc.com> wrote:
> On Saturday 04 December 2010 09:31:04 you wrote:
>> Alexander Belopolsky writes:
>>  > In fact, once the language moratorium is over, I will argue that
>>  > str.encode() and byte.decode() should deprecate encoding argument and
>>  > just do UTF-8 encoding/decoding.  Hopefully by that time most people
>>  > will forget that other encodings exist.  (I can dream, right?)
>>
>> It's just a dream.  There's a pile of archival material, often on R/O
>> media, out there that won't be transcoded any more quickly than the
>> inscriptions on Tutankhamun's tomb.
>
> Not only, many libraries expect use bytes arguments encoded to a specific
> encoding (eg. locale encoding). Said differenlty, only few libraries written in
> C accept wchar* strings.
>

My proposal has nothing to do with C-API.  It only concerns Python API
of the builtin str type.

> The Linux kernel (or many, or all, UNIX/BSD kernels) only manipulate byte
> strings. The libc only accept wide characters for a few operations. I don't
> know how to open a file with an unicode path with the Linux libc: you have to
> encode it...
>

Yes, but hopefully the encoding used by the filesystem will be UTF-8.
For Python users, however, encoding details will hopefully be hidden
by the open() call.   Yes, I am aware of the many problems with
divining the filesystem encoding, but instructing application
developers to supply their own fsencoding in
open(filepath.encode(fsencoding)) calls is not very helpful.

> Alexander: you should first patch all UNIX/BSD kernels to use unicode
> everywhere, then patch all libc implementations, and then all libraries
> (written in C). After that, you can have a break.
>

As Martin explained later in this thread with respect to the
transform() method, removing codec argument from str.encode() method
does not imply removing the codecs themselves.    If I need a method
to encode strings to say koi8_r encoding, I can easily access it
directly:

>>> from encodings import koi8_r
>>> to_koi8_r = koi8_r.Codec().encode
>>> to_koi8_r('код')
(b'\xcb\xcf\xc4', 3)

More likely, however, I will only need en/decoding to read/write
legacy files and rather than encoding the strings explicitly before
writing into a file, I will just open that file with the correct
encoding.

Having all encodings accessible in a str method only promotes a
programming style where bytes objects can contain differently encoded
strings in different parts of the program.  Instead, well-written
programs should decode bytes on input, do all processing with str type
and decode on output.  When strings need to be passed to char* C APIs,
they should be encoded in UTF-8.  Many C APIs originally designed for
ASCII actually produce meaningful results when given  UTF-8 bytes.
(Supporting such usage was one of the design goals of UTF-8.)