[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Michael Urman murman at gmail.com
Sat Apr 25 18:18:20 CEST 2009


On Sat, Apr 25, 2009 at 10:00, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> On decoding, there is a guarantee that it decodes successfully. There is
> also a guarantee that the result will re-encode successfully, and yield
> the same byte string.
>
> If you pass a different string into encoding, you still may get
> exceptions. For example, if the filesystem encoding is latin-1,
> passing u"\u20ac" will continue to raise exceptions, even under the
> python-escape error handler - that error handler will only handle
> surrogates.

One angle I've not seen discussed yet is a set of use cases. While the
PEP addresses the need for the python developer to not have to write
insane conditional code that maps between bytes and str depending on
the platform, it doesn't talk about what this allows an application to
provide to a user, and at what risks.

I see two main user-oriented use cases for the resulting Unicode
strings this PEP will produce on all systems: displaying a list of
filenames for the user to select from (an open file dialog), and
allowing a user to edit or supply a filename (a save dialog or a
rename control).

It's clear what this PEP provides for the former. On well-behaved
systems where a simpler filesystemencoding approach would work, the
results are identical; the user can select filenames that are what he
expects to see on both Unix and Windows. On less well-behaved systems,
some characters may appear as junk in the middle of the name (or would
they be invisible?), but should be recognizable enough to choose, or
at least to open sequentially and remember what the last one was. On
particularly poorly behaved systems, the results will be extremely
difficult to read, but no approach is likely to fix this.

What I don't find clear is what the risks are for the latter. On the
less well behaved system, a user may well attempt to use this python
application to fix filenames. Can we estimate a likelihood that edits
to the names would result in a Unicode string that can no longer be
encoded with the python-escape? Will a new name fully provided by a
user on his keyboard (ignoring copy and paste) almost always safely
encode?

-- 
Michael Urman


More information about the Python-Dev mailing list