[Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0?

Fri Oct 10 14:16:21 CEST 2008

On Fri, Oct 10, 2008 at 1:31 AM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 10/9/2008 11:55 PM, came the following characters from the
> keyboard of Stephen J. Turnbull:
>> The problem that all the proposals face is that they assume that we
>> know where the cleaning up will be done, and that we're in control of
>> the code that will have to do it.
>
>
> I think this is your expression of "Applications that do XXX may neeed
> modification to handle all files" :)
>
> The object wrapper gives us the right control, but likely forces more
> changes to applications than the other schemes.  BDFL has chosen scheme 2,
> it seems, unless he changes his mind.  It has the advantages that few or no
> code changes are necessary to handle files that have Unicode names, and
> applications that want to handle files with non-Unicode names can, but have
> to work harder.  If Python had come with a file path manipulation object
> from the beginning, (3) might be a better scheme, but, as much as I like and
> wish for scheme (3), scheme (2) has a better migration story, and scheme (1)
> basically only solves some of the problems some of the times, and can cause
> other problems due to data puns (although the chances of doing so are
> somewhat low, and approach zero in my environment, and likely in many
> environments... but then in my environment, and likely in many environments,
> they also don't actually solve any problems either, so I'd be just as well
> off without it).

There's a spectrum of choices, depending on how soon you want the API to fail:
* bytes/unicode distinct APIs.  unicode never fails, but does skip.
* bytes/unicode automatic.  return bytes for invalid names; fails when
concatenated to unicode strings
* invalid unicode.  Works internally, but fails when exposed to external APIs
* FilePath object.  I can't see a difference from invalid unicode?
* transformed unicode.  Works internally, can be round-tripped through
external APIs, but fails if those external APIs touch the filesystem.
Also breaks valid file names.

Since none of the options eliminate failure (and none can, short of
universally redefining UTF-8 or making the filesystem validate the
encoding), we instead pick the lesser evil.  Although the first option
does skip file names, it turns out to be the least surprising and
least magical.  Indeed, it's the only option that never fails while
listing directory contents!

-- 
Adam Olsen, aka Rhamphoryncus