[Python-Dev] Python-3.0, unicode, and os.environ

Sat Dec 6 16:52:38 CET 2008

Nick Coghlan wrote:
> Toshio Kuratomi wrote:
>>>
>> Nonsense.  A program can do tons of things with a non-decodable
>> filename.  Where it's limited is non-decodable filedata.
> 
> You can't display a non-decodable filename to the user, hence the user
> will have no idea what they're working on. Non-filesystem related apps
> have no business trying to deal with insane filenames.
> 
This is where we disagree.  There are many ways to display the
non-decodable filename to the user because the user is not a machine.
The computer must know the unique sequence of bytes in order to access a
file. The user, OTOH, usually only needs to know that the file exists.
In most GUI-based end-user oriented desktop apps, it's enough to do
str(filename, errors='replace').  For instance, the GNOME file manager
displays:
  "? (Invalid encoding)"
and Konqueror, the KDE file manager just displays:
  "?"

The file can still be displayed this way, accessed via the raw bytes
that the program keeps internally, and operated upon by applications.

For applications in which the user needs more information to
differentiate the files the program has the option to display the raw
byte sequences as if they were the filename.  The *NIX shell and command
line tools have this ability.

$ LANG=en_US.utf8 ls -b
á
í
$ LANG=C ls -b
.
..
\303\241
\303\255
$ mv $'\303\241' $'\303\263'
$ LANG=C ls -b
\303\255
\303\263
$ LANG=en_US.utf8 ls -b
í
ó

> Linux is moving towards a standard of UTF-8 for filenames, and once we
> get to the point where the idea of encoding filenames and environment
> variables any other way is seen as crazy, then the Python 3 approach
> will work seamlessly.
> 
<nod>  With the caveat that I haven't seen movement by Linux and other
Unix variants to enforce UTF-8.  What I have seen are statements by
kernel programmers that having the filesystem use bytes and not know
about encoding is the correct thing to do.

This means that utf-8 will be a convention rather than a necessity for a
very long time and consequently programs will need to worry about the
problems of mixed encoding systems for an equally long time.  (Remember,
encoding is something that can be changed per user and per file.  So on
a multiuser OS, mixed encodings can be out of the control of the system
administrator for perfectly valid reasons.)

> In the meantime, raw bytes APIs will provide an alternative for those
> that disagree with that philosophy.
> 
Oh I agree with the UTF-8 everywhere philosophy.  I just know that
there's tons of real-world systems out there that don't conform to my
expectations for sanity and my code has to account for those :-)

-Toshio

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-dev/attachments/20081206/686029a6/attachment.pgp>