[Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0?

Adam Olsen rhamph at gmail.com
Tue Sep 30 01:50:32 CEST 2008


On Mon, Sep 29, 2008 at 5:33 PM, James Y Knight <foom at fuhm.net> wrote:
> On Sep 29, 2008, at 7:23 PM, Adam Olsen wrote:
>>
>> An ugly hack, but more correct than UTF-8b or any similar attempt to
>> do "unicode but not quite unicode"; either it's lossy, or it's not
>> unicode.  There's no in between.
>
> Promoting the use of 8859-1 to decode mostly-utf-8 data seems like a very
> poor way forward. I don't see how you can claim it's more correct. It's
> correct in no case except for pure ASCII on a utf-8 system.

It's correct in the sense that it can roundtrip all filenames.  UTF-8b
is lossy, so certain filenames are not roundtripped properly.

It doesn't let you print correctly, but neither would an API that
returns bytes.  8859-1 is just a hack for when you want bytes, but the
API only allows unicode.


> I still like the UTF-8b proposal, but if you want to push against that, I
> don't see any sensible alternative but to move back towards a bytestring
> API. Having two parallel APIs or a mixture of data types is confusing, so,
> just toss the Unicode APIs entirely. That'd be much much nicer than having
> everyone use 8859-1, incorrectly, for their platform encoding.

As a user, I expect all file names to be printable.  That requires
unicode, and any program that creates filenames with arbitrary
bytestrings is just broken.  Not all operating systems enforce this
yet, but returning bytes only means we have to explicitly decode in
the 99% of cases where we'd happily assume it's correct unicode.

I'd rather the 1% of cases that need to handle bad file names make an
explicit effort to do so, via alternate byte APIs or (if necessary)
the 8859-1 hack.


> On Windows, the platform-native Unicode strings could simply be encoded into
> utf-8 when entering Python-land, and decoded back to Unicode when leaving
> pythonland, to keep the API consistently bytestring oriented on both
> platforms.


-- 
Adam Olsen, aka Rhamphoryncus


More information about the Python-3000 mailing list