[issue3187] os.listdir can return byte strings

STINNER Victor report at bugs.python.org
Thu Aug 21 22:55:34 CEST 2008


STINNER Victor <victor.stinner at haypocalc.com> added the comment:

Le Thursday 21 August 2008 18:17:47 Guido van Rossum, vous avez écrit :
> The proper work-around is for the app to pass bytes into os.listdir();
> then it will return bytes.

In my case, I just would like to remove a directory with shutil.rmtree(). I 
don't know if it contains bytes or characters filenames :-)

> It would be nice if open() etc. accepted 
> bytes (as well as strings of course), at least on Unix, but not
> absolutely necessary -- the app could also just know the right encoding.

An invalid filename has no charset. It's just a "raw" byte string. So open(), 
unlink(), etc. have to accept byte string. Maybe not in the Python version 
with in low level (C version)?

> I see two reasonable alternatives for what os.listdir() should return
> when the input is a string and one of the filenames can't be decoded:
> either omit it from the output list;

It's not a good option: rmtree() will fails because the directory in not 
empty :-/

> or use errors='replace' in the encoding.

It will also fails because filenames will be invalid (valid unicode string but 
non existent file names :-/).

> Failing the entire os.listdir() call is not acceptable, and 
> neither is returning a mixture of str and bytes instances.

Ok, I have another suggestion:
 - *by default*, listdir() only returns str and raise an error (TypeError?) 
   on invalid filename
 - add an optional argument (a callback), eg. "fallback_encoder", to catch
   such errors (similar to "onerror" from shutils.rmtree())

Example of new listdir implementation (pseudo-code):

   charset = sys.getfilesystemcharset()
   dirobj = opendir(path)
   try:
      for bytesname in readdir(dirobj):
          try:
              name = str(bytesname, charset)
          exept UnicodeDecodeError:
              name = fallback_encoder(bytesname)
          yield name
   finally:
      closedir(dirobj)

The default fallback_encoder:

   def fallback_encoder(name):
      raise

Keep raw bytes string:

   def fallback_encoder(name):
      return name

Create my custom filename object:

   class Filename:
      ...

   def fallback_encoder(name):
      return Filename(name)

If a callback is overkill, we can just add an option, 
eg. "keep_invalid_filename=True", to ask listdir() to keep bytes string if 
the conversion to unicode fails.

In any case, open(), unlink(), etc. have to accept byte string to be accept to 
read, copy, remove invalid filenames. In a perfect world, all filenames would 
be valid UTF-8 strings, but in the real world (think to Matrix :-)), we have 
to support such strange cases...

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3187>
_______________________________________


More information about the Python-bugs-list mailing list