[Python-Dev] Filename as byte string in python 2.6 or 3.0?

Wed Oct 1 00:18:33 CEST 2008

Adam Olsen wrote:
> On Tue, Sep 30, 2008 at 3:43 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> Of the suggestions I've seen so far, I like Marcin's Mono-inspired
>> NULL-escape codec idea the best. Since these strings all come from parts
>> of the environment where NULLs are not permitted, a simple "'\0' in
>> text" check will immediately identify any strings where decoding failed
>> (for applications which care about the difference and want to try to do
>> better), while applications which don't care will receive perfectly
>> valid Python strings that can be passed around and manipulated as if the
>> decoding error never happened.
> 
> It avoids the technical problems, but it's still magical behaviour
> that users have to learn, whereas bytes/unicode polymorphism uses the
> distinctions you should already know about.
> 
> There's also a problem of how to turn it on.  I'm against
> automatically Python changing the filesystem encoding, no matter how
> well intentioned.  Better to let the app do that, which is easy and
> could be done for all apps (not just python!) if someone defined a
> libc encoding of "null-escaped UTF-8".
> 
> On the whole I'm only -0 on it (compared to -1 for UTF-8b).

For the decoding side, you wouldn't need to do it as a codec - you could
do it as a 'nullescape' error handler (since NULLs can't be present in
the byte sequences being decoded, there is no need to worry about
escaping anything when decoding is successful).

Converting those NULL escaped strings back into something the filesystem
can understand would obviously need a custom codec though, but some kind
of application level handling of bad filenames is going to be needed no
matter how we deal with bad encoding on the input side.

That said, I don't think this is something we (or, more to the point,
Guido) need to make a decision on right now - for 3.0, having
bytes-level APIs that can see everything, and Unicode APIs that ignore
badly encoded filenames is worth trying. If it proves inadequate, then
we can revisit the idea of some kind of implicit escaping mechanism in
the Unicode APIs for 3.1 when there is more time for a proper PEP.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
            http://www.boredomandlaziness.org