[Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

Tue Sep 30 08:52:21 CEST 2008

On Tue, Sep 30, 2008 at 12:22 AM, Georg Brandl <g.brandl at gmx.net> wrote:
> Victor Stinner schrieb:
>> Le Monday 29 September 2008 18:45:28 Georg Brandl, vous avez écrit :
>>> If I had to choose, I'd still argue for the modified UTF-8 as filesystem
>>> encoding (if it were UTF-8 otherwise), despite possible surprises when a
>>> such-encoded filename escapes from Python.
>>
>> If I understand correctly this solution. The idea is to change the default
>> file system encoding, right? Eg. if your filesystem is UTF-8, use ISO-8859-1
>> to make sure that UTF-8 conversion will never fail.
>
> No, that was not what I meant (although it is another possibility). As I wrote,
> Martin's proposal that I support here is using the modified UTF-8 codec that
> successfully roundtrips otherwise invalid UTF-8 data.
>
> You seem to forget that (disregarding OSX here, since it already enforces
> UTF-8) the majority of file names on Posix systems will be encoded correctly.
>
>> Let's try with an ugly directory on my UTF-8 file system:
>> $ find
>> ..
>> ../têste
>> ../ô
>> ../a?b
>> ../dossié
>> ../dossié/abc
>> ../dir?name
>> ../dir?name/xyz
>>
>> Python3 using encoding=ISO-8859-1:
>>>>> import os; os.listdir(b'.')
>> [b't\xc3\xaaste', b'\xc3\xb4', b'a\xffb', b'dossi\xc3\xa9', b'dir\xffname']
>>>>> files=os.listdir('.'); files
>> ['tÃªste', 'Ã´', 'aÿb', 'dossiÃ(c)', 'dirÿname']
>>>>> open(files[0]).close()
>>>>> os.listdir(files[-1])
>> ['xyz']
>>
>> Ok, I have unicode filenames and I'm able to open a file and list a directory.
>> The problem is now to display correctly the filenames.
>>
>> For me "unicode" sounds like "text (characters) encoded in the correct
>> charset". In this case, unicode is just a storage for *bytes* in a custom
>> charset.
>
>> How can we mix <custom unicode (bytes encoded in ISO-8859-1)> with <real
>> unicode>? Eg. os.path.join('dossiÃ(c)', "fichié") : first argument is encoded
>> in ISO-8859-1 whereas the second argument is encoding in Unicode. It's
>> something like that:
>>    str(b'dossi\xc3\xa9', 'ISO-8859-1') + '/' + 'fichi\xe9'
>>
>> Whereas the correct (unicode) result should be:
>>    'dossié/fichié'
>> as bytes in ISO-8859-1:
>>    b'dossi\xc3\xa9/fichi\xc3\xa9'
>> as bytes in UTF-8:
>>    b'dossi\xe9/fichi\xe9'
>
> With the filenames decoded by UTF-8, your files named têste, ô, dossié will
> be displayed and handled correctly. The others are *invalid* in the filesystem
> encoding UTF-8 and therefore would be represented by something like
>
> u'dir\uXXffname' where XX is some private use Unicode namespace. It won't look
> pretty when printed, but then, what do other applications do? They e.g. display
> a question mark as you show above, which is not better in terms of readability.
>
> But it will work when given to a filename-handling function. Valid filenames
> can be compared to Unicode strings.
>
> A real-world example: OpenOffice can't open files with invalid bytes in their
> name. They are displayed in the "Open file" dialog, but trying to open fails.
> This regularly drives me crazy. Let's not make Python not work this way too,
> or, even worse, not even display those filenames.

The only way to display that file would be to transform it into some
other valid unicode string.  However, as that string is already valid,
you've just made any files named after it impossible to open.  If you
extend unicode then you're unable to display that extended name[1].

I think Guido's right on this one.  If I have to choose between
openoffice crashing or skipping my file, I'd vastly prefer it skip it.
 A warning would be a nice bonus (from python or from openoffice),
telling me there's a buggered file I should go fix.  Renaming the file
is the end solution.

[1] You could argue that Unicode should add new scalars to handle all
currently invalid UTF-8 sequences.  They could then output to their
original forms if in UTF-8, or a mundane form in UTF-16 and UTF-32.
However, I suspect "we don't want to add validation to linux" will not
be a very persuasive argument.

-- 
Adam Olsen, aka Rhamphoryncus