[Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue

Wed Oct 1 09:54:47 CEST 2008

On Tuesday 30 September 2008, M.-A. Lemburg wrote:
> On 2008-09-30 08:00, Martin v. Löwis wrote:
> >> Change the default file system encoding to store bytes in Unicode is
> >> like introducing a new Python type: <fake Unicode for filename hacks>.
> >
> > Exactly. Seems like the best solution to me, despite your polemics.
>
> Not a bad idea... have os.listdir() return Unicode subclasses that work
> like file handles, ie. they have an extra buffer that holds the original
> bytes value received from the underlying C API.

Why does it have to be a Unicode subclass? In my eyes, a Unicode object 
promises a few things, in particular that it contains a Unicode string. If it 
now suddenly contains bytes without any further meaning, that would be bad.

What I wonder is what the requirements on path handling are. I'll try to list 
the ones I can see:

1. A path received from the system should be preserved, so it can be given to 
the system later on. IOW, the internal representation should not loose any 
information compared to the one used by the OS.

2. Typical operations like joining two path segments or moving to the parent 
dir should be defined.

3. There must be a way to display the path to the user. IOW, there should be a 
way to turn the path into a string that the user can recognise, according to 
some encoding. Note that this is not always possible, so this can fail.

4. There must be a way to receive a path from the user. That means that there 
must be a way from a user-entered string to a path. Note that this, too, 
isn't always possible and can fail.

5. The conversion between a string and a path should be configurable, defaults 
retrieved from the system. This is so that most operations will just work and 
do the thing that the user expects.

6. There should be a way to modify the path data itself. This of course 
requires knowledge about the internals but gives full power to the 
programmer.

For requirement 3, I would say a lossy conversion to a string would be enough, 
i.e. try to convert the path to a Unicode string and use a question mark or 
some escaping to mark parts that can't be decoded. It will allow users to 
recognise the decodeable parts of the path with hopefully just a few 
characters left without decoding.

For requirement 4, a failure to encode a string to a path must result in a 
loud failure, i.e. an exception. This is because the user entered a path that 
we can't use, any guessing what the user might have wanted is futile.

Are there any points to add?

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**************************************************************************************
           Visit our website at <http://www.satorlaser.de/>
**************************************************************************************
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich.

**************************************************************************************