[Python-Dev] Filename as byte string in python 2.6 or 3.0?

Mon Sep 29 13:16:11 CEST 2008

On 2008-09-29 12:50, Ulrich Eckhardt wrote:
> On Sunday 28 September 2008, Gregory P. Smith wrote:
>> "broken" systems will always exist.  Code to deal with them must be
>> possible to write in python 3.0.
>>
>> since any given path (not just fs) can have its own encoding it makes
>> the most sense to me to let the OS deal with the errors and not try to
>> enforce bytes vs string encoding type at the python lib. level.
> 
> Actually I'm afraid that that isn't really useful. I, too, would like to kick 
> peoples' back in order to get the to fix their systems or use the proper 
> codepage while mounting etc, etc, but that is not going to happen soon. Just 
> ignoring those broken systems is tempting, but alienating a large group of 
> users isn't IMHO worth it.
> 
> Instead, I'd like to present a different approach:
> 
> 1. For POSIX platforms (using a byte string for the path):
> Here, the first approach is to convert the path to Unicode, according to the 
> locale's CTYPE category. Hopefully, it will be UTF-8, but also codepages 
> should work. If there is a segment (a byte sequence between two path 
> separators) where it doesn't work, it uses an ASCII mapping where possible 
> and codepoints from the "Private Use Area" (PUA) of Unicode for the 
> non-decodable bytes.
> In order to pass this path to fopen(), each segment would be converted to a 
> byte string again, using the locale's CTYPE category except for segments 
> which use the PUA where it simply encodes the original bytes.

I'm not sure how this would work. How would you map the private use
code points back to bytes ? Using a special codec that knows about
these code points ? How would the fopen() know to use that special
codec instead of e.g. the UTF-8 codec ?

BTW: Private use areas in Unicode are meant for e.g. company specific
code points. Using them for escaping purposes is likely to cause problems
due to assignment clashes.

Regarding the subject of file names:

On Unix, it's well possible to have to deal with 2-3 different file
systems mounted on a machine. Each of those may use a different file name
encoding or not support file name encoding at all.

If the OS doesn't guarantee a consistent file name encoding, then
why should Python try to emulate this on top of the OS ?

I think it's more important to be able to open a file, than to have
a readable file name when printing it to stdout, e.g. I wouldn't be able
to tell whether some Chinese file name makes sense or not, but if I know
that all files in a directory are meant for processing I should be able
to iterate over them regardless of whether they make sense or not.

> 2. For win32 platforms, the path is already Unicode (UTF-16) and the whole 
> problem is solved or not solved by the OS.
> 
> In the end, both approaches yield a path represented by a Unicode string for 
> intermediate use, which provides maximum flexibility. Further, it 
> preserves "broken" encodings by simply mapping their byte-values to the PUA 
> of Unicode. Maybe not using a string to represent a path would be a good 
> idea, too. At least it would make it very clear that the string is not 
> completely free-form.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Sep 29 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611