[Python-Dev] Python-3.0, unicode, and os.environ

Fri Dec 12 02:55:52 CET 2008

Steve Holden writes:
 > Ulrich Eckhardt writes:

 > > What I'd just like some feedback on is the approach to return a
 > > distinct type (neither a byte string nor a Unicode string) from
 > > readdir().

This is presumably unacceptable on the grounds that it will break
existing code that does something more or less useful more or less
some of the time.<wink>

 > If you know what your filesystem produces, you can take the appropriate
 > action to convert it into a type that makes sense to the user.

Unfortunately, even programmers experienced in I18N like Martin, and
those with intuition-that-has-the-force-of-law<wink> like Guido,
express deliberate disbelief on this point.  They say that filesystem
names and environment variable values are text, which is true from the
semantic viewpoint but can't be fully supported by any implementation.

The implementation issue is why you want bytes, but I don't think it
is going to overcome the tide of (semantically-oriented) pragmatism.

 > If you don't, then at least if you have the string in its bytes
 > form you can re-present it to the filesystem to manipulate the
 > file. What are we supposed to do with the "special type"?

Trivially convert it back to bytes and re-present it to the
filesystem, of course.

I gather that the BFDL's line on this thread of discussion is that
forcing programmers to think about encodings every time they call out
to the OS is unacceptable when most programs will work acceptably
almost all of the time with a rather naive approach.  This means that
almost all Python programs will be technically broken for the
forseeable future, sorry, Ulrich.

And for the same pragmatic reasons, these functions are going to
return strings (ie, Unicode), not bytes, I expect.  Sorry, Steve.

What needs to be determined here is the best way to provide
reliability to those who will go to the effort of asking for it if
it's available.  I don't think "just return bytes" fits the bill for
the reason above.

What I would like to see is a type that is derived from string (so if
you present it to an API expecting string, it is silently treated as
string), but from which the original bytes can always be extracted on
request.  If the original bytes cannot be sensibly decoded to a
string, then the string field in the object would either contain
something that should normally cause an error in a string API, or some
made-up string (presumably it would attempt to be a more or less
faithful representation of the bytes) at the caller's option.
Probably they'd also contain some metadata useful in guessing
encodings (the read time locale in particular).

These objects probably shouldn't support string-like operations in a
general way (ie, maintaining both the string representation and the
bytes "correctly").  Rather, using "proper" string operations on them
would use the string content and produce strings.  People who really
want to handle mixed-encoding pathnames and the like would have to
keep collections of these objects and handle them in an ad-hoc way.

Unfortunate implementing this is way beyond my skills and time
availability.