[Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0?

Stephen J. Turnbull stephen at xemacs.org
Tue Sep 30 05:11:12 CEST 2008


James Y Knight writes:
 > On Sep 29, 2008, at 3:32 AM, Adam Olsen wrote:

 > > UTF-8b doesn't work as intended.  It produces an invalid unicode
 > > object (garbage surrogates) that cannot be used with external APIs or
 > > libraries that require unicode.
 > 
 > I'd be interested to hear more detail on what you expect the practical  
 > ramifications of this to be. It doesn't sound likely to be a problem  
 > to me.

That's because you have a specific use case in mind.  Adam clearly has
in mind passing the filename on to a library which might proceed to
signal an error (to him, unexpected) on garbage surrogates.  He
doesn't want to be surprised by that.

The problem is that all of these hacks involve a private encoding that
looks like something else, and standards-conforming external programs
will be confused by them.  You can't prevent them from leaking unless
you store them as a non-text type, which has huge ramifications.

 > > If you don't need unicode then your
 > > code should state so explicitly, and 8859-1 is ideal there.
 > 
 > But, I *do* want unicode. ALL my filenames are encoded in utf8.  

That's not what really is at issue here.  The point is that in the
exceptional case where you get non-Unicode, and are willing to accept
it, ersatz binary (ISO-8859-1) works fine.  The problem is tagging
this as an exceptional filename that doesn't use the usual encoding;
that should be done by the application, I think.  Most applications
won't need it.

 > Except...that one over there. That's the whole point of UTF-8b:  
 > correctly encoded names get decoded correctly and readably, and the  
 > other cases get decoded into something unique that cannot possibly  
 > conflict.

Sure.  But there are lots of other operations besides encoding and
decoding that we do with filenames.  How do you display a filename?
How about concatenating them to make paths?  What do you do when you
want to mix a filename with other, well-formed strings?  If you keep
the filenames internally in UTF-8b, you're going to need what amounts
to a whole string API for dealing with them, aren't you?  If you're
not doing that, how is UTF-8b represented?

And in any case, when you do want to process them as text, the
"something unique" will have to be handled exceptionally.  I don't
think it makes sense to delay that exception; the exception should be
raised as soon as Python fails to make sense of the filename.  What to
do about that exception is a policy matter, as well.  Shouldn't that
policy be decided at the application level, rather than the Python
level?


More information about the Python-3000 mailing list