[Python-Dev] PEP 383 (again)

Tue Apr 28 09:30:01 CEST 2009

> > Therefore, when Python encounters path names on a file system
> > that are not consistent with the (assumed) encoding for that file
> > system, Python should raise an error.
>
> This is what happens currently, and users are quite unhappy about it.

We need to keep "users" and "programmers" distinct here.

Programmers may find it inconvenient that they have to spend time figuring
out and deal with platform-dependent file system encoding issues and
errors.  But internationalization and unicode are hard, that's just a fact
of life.

End users, however, are going to be quite unhappy if they get a string of
gibberish for a file name because you decided to interpret some non-Unicode
string as UTF-8-with-extra-bytes.

Or some Python program might copy files from an ISO8859-15 encoded file
system to a UTF-8 encoded file system, and instead of getting an error when
the encodings are set incorrectly, Python would quietly create ISO8859-15
encoded file names, making the target file system inconsistent.

There is a lot of potential for major problems for end users with your
proposals.  In both cases, what should happen is that the end user gets an
error, submits a bug, and the programmer figures out how to deal with the
encoding issues correctly.

> Yes, users can do that (to a degree), but they are still unhappy about
> it. The approach actually fails for command line arguments

As it should: if I give an ISO8859-15 encoded command line argument to a
Python program that expects a UTF-8 encoding, the Python program should tell
me that there is something wrong when it notices that.  Quietly continuing
is the wrong thing to do.

If we follow your approach, that ISO8859-15 string will get turned into an
escaped unicode string inside Python.  If I understand your proposal
correctly, if it's a output file name and gets passed to Python's open
function, Python will then decode that string and end up with an ISO8859-15
byte sequence, which it will write to disk literally, even if the encoding
for the system is UTF-8.   That's the wrong thing to do.

As is, these interfaces are incomplete - they don't support command
> line arguments, or environment variables. If you want to complete them,
> you should write a PEP.

There's no point in scratching when there's no itch.

Tom

PS:

> Quietly escaping a bad UTF-8 encoding with private Unicode characters is
> > unlikely to be the right thing
>
> And indeed, the PEP stopped using PUA characters.

Let me rephrase this: "quietly escaping a bad UTF-8 encoding is unlikely to
be the right thing"; it doesn't matter how you do it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20090428/cf12222f/attachment.htm>