PEP 383: Non-decodable Bytes in System Character Interfaces

Thu Apr 23 19:27:12 EDT 2009

On 22Apr2009 08:50, Martin v. L�wis <martin at v.loewis.de> wrote:
| File names, environment variables, and command line arguments are
| defined as being character data in POSIX;

Specific citation please? I'd like to check the specifics of this.

| the C APIs however allow
| passing arbitrary bytes - whether these conform to a certain encoding
| or not.

Indeed.

| This PEP proposes a means of dealing with such irregularities
| by embedding the bytes in character strings in such a way that allows
| recreation of the original byte string.
[...]

So you're proposing that all POSIX OS interfaces (which use byte strings)
interpret those byte strings into Python3 str objects, with a codec
that will accept arbitrary byte sequences losslessly and is totally
reversible, yes?

And, I hope, that the os.* interfaces silently use it by default.

| For most applications, we assume that they eventually pass data
| received from a system interface back into the same system
| interfaces. For example, and application invoking os.listdir() will
| likely pass the result strings back into APIs like os.stat() or
| open(), which then encodes them back into their original byte
| representation. Applications that need to process the original byte
| strings can obtain them by encoding the character strings with the
| file system encoding, passing "python-escape" as the error handler
| name.

-1

This last sentence kills the idea for me, unless I'm missing something.
Which I may be, of course.

POSIX filesystems _do_not_ have a file system encoding.

The user's environment suggests a preferred encoding via the locale
stuff, and apps honouring that will make nice looking byte strings as
filenames for that user. (Some platforms, like MacOSX' HFS filesystems,
_do_ enforce an encoding, and a quite specific variety of UTF-8 it is;
I would say they're not a full UNIX filesystem _precisely_ because they
reject certain byte strings that are valid on other UNIX filesystems.
What will your proposal do here? I can imagine it might cope with
existing names, but what happens when the user creates a new name?)

Further, different users can use different locales and encodings.
If they do it in different work areas they'll be perfectly happy;
if they do it in a shared area doubtless confusion will reign,
but only in the users' minds, not in the filesystem.

If I'm writing a general purpose UNIX tool like chmod or find, I expect
it to work reliably on _any_ UNIX pathname. It must be totally encoding
blind. If I speak to the os.* interface to open a file, I expect to hand
it bytes and have it behave. As an explicit example, I would be just fine
with python's open(filename, "w") to take a string and encode it for use,
but _not_ ok for os.open() to require me to supply a string and cross
my fingers and hope something sane happens when it is turned into bytes
for the UNIX system call.

I'm very much in favour of being able to work in strings for most
purposes, but if I use the os.* interfaces on a UNIX system it is
necessary to be _able_ to work in bytes, because UNIX file pathnames
are bytes.

If there isn't a byte-safe os.* facility in Python3, it will simply be
unsuitable for writing low level UNIX tools. And I very much like using
Python2 for that.

Finally, I have a small python program whose whole purpose in life
is to transcode UNIX filenames before transfer to a MacOSX HFS
directory, because of HFS's enforced particular encoding. What approach
should a Python app take to transcode UNIX pathnames under your scheme?

Cheers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

The nice thing about standards is that you have so many to choose from;
furthermore, if you do not like any of them, you can just wait for next
year's model.   - Andrew S. Tanenbaum