[Python-Dev] Filename as byte string in python 2.6 or 3.0?

glyph at divmod.com glyph at divmod.com
Mon Sep 29 13:28:51 CEST 2008


On 10:50 am, eckhardt at satorlaser.com wrote:
>On Sunday 28 September 2008, Gregory P. Smith wrote:
>>"broken" systems will always exist.  Code to deal with them must be
>>possible to write in python 3.0.

>>since any given path (not just fs) can have its own encoding it makes
>>the most sense to me to let the OS deal with the errors and not try to
>>enforce bytes vs string encoding type at the python lib. level.

>Actually I'm afraid that that isn't really useful. I, too, would like 
>to kick
>peoples' back in order to get the to fix their systems or use the 
>proper
>codepage while mounting etc, etc, but that is not going to happen soon. 
>Just
>ignoring those broken systems is tempting, but alienating a large group 
>of
>users isn't IMHO worth it.
>
>Instead, I'd like to present a different approach:

>1. For POSIX platforms (using a byte string for the path):
>Here, the first approach is to convert the path to Unicode, according 
>to the
>locale's CTYPE category. Hopefully, it will be UTF-8, but also 
>codepages
>should work. If there is a segment (a byte sequence between two path
>separators) where it doesn't work, it uses an ASCII mapping where 
>possible
>and codepoints from the "Private Use Area" (PUA) of Unicode for the
>non-decodable bytes.

>In order to pass this path to fopen(), each segment would be converted 
>to a
>byte string again, using the locale's CTYPE category except for 
>segments
>which use the PUA where it simply encodes the original bytes.

That's a cool idea, but this encoding hack would need to be clearly 
documented and exposed for when you need to talk to another piece of 
software about pathnames.  Consider a Python implementation of "xargs". 
Right now this can be implemented as a pretty simple for loop which 
eventually invokes 'subprocess.call' or similar.

http://docs.python.org/dev/3.0/library/os.html#process-management 
doesn't say what the type of the arguments to the various 'exec' 
variants are - one presumes they'd have to be bytes.  Not all arguments 
to subprocesses need to be filenames, but when they are they need to be 
encoded appropriately.

Also, consider the following nightmare scenario: a system which has two 
users with incompatible locales.  One wishes to write a "text" (ha ha) 
file with a list of pathnames in it to share with the other.  What 
encoding should that file be in?  How should the other user know how to 
interpret it?  (And of course: what if that user is going to be piping 
that file to "xargs", or the original file came out of "find"?)  I don't 
think that you can do encoding a segment at a time here, at least not at 
the API level; however, the whole file could be written in the py-posix- 
paths encoding which does exactly what you propose.
>2. For win32 platforms, the path is already Unicode (UTF-16) and the 
>whole
>problem is solved or not solved by the OS.

If the "or not solved" part of that is true then this probably bears 
further investigation.  I suspect that the OS *always* provides some 
solution, even if it's the wrong solution, though.

Also, what about MacOS X?
>In the end, both approaches yield a path represented by a Unicode 
>string for
>intermediate use, which provides maximum flexibility. Further, it
>preserves "broken" encodings by simply mapping their byte-values to the 
>PUA
>of Unicode. Maybe not using a string to represent a path would be a 
>good
>idea, too. At least it would make it very clear that the string is not
>completely free-form.

Personally, I plan to use this:

http://twistedmatrix.com/documents/8.1.0/api/twisted.python.filepath.FilePath.html

for all of my file I/O in the future.

For what it's worth, this object _doesn't_ handle unicode properly and 
it's been a thorn in our side for quite a while.  We have plans to 
implement some kind of unicode-friendly API which is compatible with 
2.6; if we have any brilliant ideas I'll let you know, but I doubt 
they'll be in time.

The general idea right now is that we'll keep around the original bytes 
returned from filesystem inspection and provide some context-sensitive 
encoding/decoding APIs for different applications.  The PUA approach 
would allow us to maintain an API compatible with that.  I would not 
actually mind if there were a POSIX-specific module we had to use to get 
every arcane nuance of brokenness of writing pathnames into text files 
to be correct, since Windows needs to come up with _some_ valid unicode 
filename for every file in the system (even if it's improperly decoded).


More information about the Python-Dev mailing list