[Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0?

Sat Oct 4 08:57:36 CEST 2008

On Fri, Oct 3, 2008 at 10:14 PM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 10/3/2008 4:54 PM, came the following characters from the
> keyboard of Adam Olsen:
>> On Fri, Oct 3, 2008 at 5:02 PM, Glenn Linderman <v+python at g.nevcal.com>
>> wrote:
>
> OK, so UTF-8b is not Unicode, either.  It's just garbage.  You can't have it
> both ways.

I've always said UTF-8b wasn't valid.

>>> Seems like attempts to manipulate and transform names are doomed to
>>> failure;
>>> the approach of having a bytes level interface seems to be the correct
>>> one,
>>> glad that seems to be the approach that Victor is implementing and Guido
>>> is
>>> favoring, although it is a pity that it can't be fully encapsulated into
>>> an
>>> object in time for 3.0, leaving us with multiple APIs for file access,
>>> and a
>>> potential future translation to an encapsulated object approach.
>>>
>>
>> the bytes object covers 90% of the raw usage.  The other 10% is a
>> lossy encoding to unicode.  I much prefer that to be explicit, so an
>> attribute may do.. say b.decode('UTF-8', 'replace')?  Or do we need a
>> subtype of bytes, just to reduce that to 5-8 characters?
>>
>
> I don't understand what you mean here... Victor/Guido's plan results in:
>
> Alternative 1:  Windows only programs can use the Python Unicode file
> interfaces, Posix programs can take a chance, and also use them (one stab at
> semi-portability, if people don't need access to weirdly named files).

Windows programs using non-validating unicode APIs will be exposed to
random exceptions when they use a validating unicode API.  Better to
validate everything early, where you can expect the failures.

Posix programs SHOULD take a chance.  It's much easier to deal with
pure unicode, and some things can only be done that way (such as
getting file names from the user through a GUI).

> Alternative 2: Posix only programs can use the Python bytes file interfaces
> and get all the files, but can't necessarily display them, except in lossy
> Unicode or hex, or by pretending they are Latin-1, or whatever they want to
> do, but they can't assume UTF-8, unless it happens to work.  Windows
> programs can use the bytes interface (another stab at semi-portability), if
> people don't need access to files named using Unicode characters not in the
> program's current code page.

Can't display them, can't export them.  'tis fun!

> Alternative 3: Portable programs use the Unicode file interfaces on Windows,
> and the bytes file interfaces on Posix, and deal with the differences, as
> described for Windows only in alternative 1 and Posix only in alternative 2.
>
> Alternative 4: Someone implements an object that does alternative 3 under
> the covers, and every one will wish Alternative 1 & 2 didn't even exist.
>  The only reasons not to do this seem to be (a) Python 2.6 is already
> released and doesn't have it, (b) Python 3.0 would slip its schedule even
> more, (c) it's a significant chunk of code to implement and get right in a
> hurry.

Nope, not possible.  The closest we can do is "bytes with implicit
conversion to unicode", but (a) implicit conversion is much less
maintainable (zen, etc), (b) it STILL doesn't work.  You still can't
round-trip a bad file name through a unicode API.

You have the file system and the user/libraries, and never the twain shall meet.

-- 
Adam Olsen, aka Rhamphoryncus