[Python-Dev] PEP 383 (again)

Tue Apr 28 14:47:40 CEST 2009

On Tue, 28 Apr 2009 at 09:30, Thomas Breuel wrote:
>>> Therefore, when Python encounters path names on a file system
>>> that are not consistent with the (assumed) encoding for that file
>>> system, Python should raise an error.
>>
>> This is what happens currently, and users are quite unhappy about it.
>
> We need to keep "users" and "programmers" distinct here.
>
> Programmers may find it inconvenient that they have to spend time figuring
> out and deal with platform-dependent file system encoding issues and
> errors.  But internationalization and unicode are hard, that's just a fact
> of life.

And most programmers won't do it, because most programmers write for
an English speaking audience and have no clue about unicode issues.
That is probably slowly changing, but it is still true, I think.

> End users, however, are going to be quite unhappy if they get a string of
> gibberish for a file name because you decided to interpret some non-Unicode
> string as UTF-8-with-extra-bytes.

No, end users expect the gibberish, because they get it all the time
(at least on Unix) when dealing with international filenames.  They
expect to be able to manipulate such files _despite_ the gibberish.
(I speak here as an end user who does this!!)

> Or some Python program might copy files from an ISO8859-15 encoded file
> system to a UTF-8 encoded file system, and instead of getting an error when
> the encodings are set incorrectly, Python would quietly create ISO8859-15
> encoded file names, making the target file system inconsistent.

As will almost all unix programs, and the unix OS itself.  On Unix,
you can't make the file system inconsistent by doing this, because
filenames are just byte strings with no NULLs.

How _does_ Windows handle this?  Would a Windows program complain, or
would it happily record the gibberish?  I suspect the latter, but
I don't use Windows so I don't know.

> There is a lot of potential for major problems for end users with your
> proposals.  In both cases, what should happen is that the end user gets an
> error, submits a bug, and the programmer figures out how to deal with the
> encoding issues correctly.

What would actually happen is that the user would abandon the program
that didn't work for one (not written in Python) that did.  If the
programmer was lucky they'd get a bug report, which they wouldn't
be able to do anything about since Python wouldn't be providing the
tools to let them fix it (ie: there are currently no bytes interfaces
for environ or the command line in python3).

>> Yes, users can do that (to a degree), but they are still unhappy about
>> it. The approach actually fails for command line arguments
>
> As it should: if I give an ISO8859-15 encoded command line argument to a
> Python program that expects a UTF-8 encoding, the Python program should tell
> me that there is something wrong when it notices that.  Quietly continuing
> is the wrong thing to do.

Imagine you are on a unix system, and you have gotten from somewhere a
file whose name is encoded in something other than UTF-8 (I have a
number of those on my system).  Now imagine that I want to run a python
program against that file, passing the name in on the command line.
I type the program name, the first few (non-mangled) characters, and hit
tab for completion, and my shell automagically puts the escaped bytes
onto the command line.  Or perhaps I cut and paste from an 'ls' listing
into a quoted string on the command line.

Python is now getting the mangled filename passed in on the command
line, and if the python program can't manipulate that file like any
other file on my disk I am going to be mightily pissed.

This is the _reality_ of current unix systems, like it or not.  The same
apparently applies to Windows, though in that case the mangled names may
be fewer and you tend to pick them from a GUI interface rather than do
cut-and-paste or tab completion.

> If we follow your approach, that ISO8859-15 string will get turned into an
> escaped unicode string inside Python.  If I understand your proposal
> correctly, if it's a output file name and gets passed to Python's open
> function, Python will then decode that string and end up with an ISO8859-15
> byte sequence, which it will write to disk literally, even if the encoding
> for the system is UTF-8.   That's the wrong thing to do.

Right.  Like I said, that's what most (almost all) Unix/Linux programs
_do_.

Now, in some future world where everyone (including Windows) acts like
we are hearing OS/X does and rejects the garbled encoding _at the OS
level_, then we'd be able to trust the file system encoding (FSDO trust)
and there would be no need for this PEP or any similar solution.

--David