[Python-Dev] Python-3.0, unicode, and os.environ

Toshio Kuratomi a.badger at gmail.com
Sun Dec 7 19:03:13 CET 2008


glyph at divmod.com wrote:
> 
> On 06:07 am, a.badger at gmail.com wrote:
>> Most apps aren't file managers or ftp clients but when they interact
>> with files (for instance, a file selection dialog) they need to be able
>> to show the user all the relevant files.  So on an app-by-app basis the
>> need for this is high.
> 
> While I tend to agree emphatically with this, the *real* solution here
> is a path-abstraction library.

Why don't you send me some information offlist.  I'm not sure I agree
that a path-abstraction library can work correctly but if it can it
would be nice to have that at a level higher than the file-dialog
libraries that I was envisioning.

[snip]

>> ... but that still
>> doesn't help me identify when someone would expect that asking python
>> for a list of all files in a directory or a specific set of files in a
>> directory should, without warning, return only a subset of them.  In
>> what situations is this appropriate behaviour?
> 
> If you say listdir(unicode) on a POSIX OS, your program is saying "I
> only know how to deal with unicode results from this function, so please
> only give me those.".

No.  (explained below)

>  If your program is smart enough to deal with
> bytes, then you would have asked for bytes, no?

Yes (explained below)

>  Returning only
> filenames which can be properly decoded makes sense.  Otherwise everyone
> needs to learn about this highly confusing issue, even for the simplest
> scripts.
>
os.listdir(unicode) (currently) means that the *programmer* is asking
that the stdlib return the decodable filenames from this directory.  The
question is whether the programmer understood that this is what they
were asking for and whether it is what they most likely want.  I would
make the following statements WRT to this:

1) The programmer most likely does not want decodable filenames and only
decodable filename.  If they were, we'd see a lot of python2.x code that
turns pathnames into unicode and discards everything that wasn't
decodable.  No one has given a use case for finding only the *decodable*
subset of files.  If I request to see all *.py files in a directory, I
want to see all of the *.py files in the directory, decodable or not.
If you can show how programmers intend "90%" of their calls to
os.listdir()/glob.glob('*.txt') to show only the decodable subset of the
results, then the foundation of my arguments is gone.  So please, give
examples to prove this wrong.

  - If this is true, a definition of os.listdir(<type 'str'>) that would
better meet programmer expectation would be: "Give me all files in a
directory with the output as str type".  The definition of
os.listdir(<type 'bytes'>) would be "Give me all files in a directory
with the output as bytes type".  Raising an exception when the filenames
are undecodable is perfectly reasonable in this situation.

2) For the programmer to understand the difference between
os.listdir(<type 'bytes'>) and os.listdir(<type 'str'>) they have to
understand the "highly confusing issue" and what it means for their
code.  So the current method is forcing programmers to understand it
even for the simplest scripts if their environment is not uniform with
no clue from the interpreter that there is an issue.

  - Similarly, raising an exception on undecodable values means that the
programmer can ignore the issue in any scripts in sane environments and
will be told that they need to deal with it (via an exception) when
their script runs in a non-sane environment.

3) The usage of unicode vs bytes is easy to miss for someone starting
with py2.x or windows and moving to a multi-platform or unix project.
Even simple testing won't reveal the problem unless the programmer knows
that they have to test what happens when encodings are mixed.  Once
again, this is requiring the programmer to understand the encoding issue
 without help from the interpreter.

> Skipping undecodable values is good enough that it will work 90% of the
> time.

You and Guido have now made this claim to defend not raising an
exception but I still don't have a use case.

Here are use cases that I see:

* Bill is coding an application for use inside his company.  His company
only uses utf-8.  His code naively uses os.listdir(<type 'str'>).

  - The code does not throw an exception whether we use the current
os.listdir() or one that could throw an exception because the system
admins have sanitised the environment.  Bill did not need to understand
the implications of encoding for his code to work in this script whether
simple or complex.

* Mary is coding an application for use inside her company.  It finds
all html files on a system and updates her company's copyright, privacy
policy, and other legal boilerplate.  Her expectation is that after her
program runs every file will have been updated.  Her environment is a
mixture of different filename encodings due to having many legacy
documents for users in different locales.  Mary's code also naively uses
os.listdir(<type 'str'>).  Her test case checks that the code does the
right thing on many languages but unfortunately doesn't check with
different encodings because she'd have to already understand the
encoding issue to check for that.

  - With the current approach, the code will silently do the wrong thing
in production for years, until someone notices and alerts the company
that something is wrong with certain files in certain locales.  By then,
Mary may no longer be involved with the company and there are thousands
of users who thought they were operating under the old legal terms
instead of the new ones.

  - With exceptions raised, Mary will be alerted of the problem when she
tries to run the code in production for the first time.  She can then do
a little research and fix it to run correctly.  The traceback that's
issued can be googled and the line that it points to will show where the
error is occurring.

* Arthur's company has shipped some of his code in a product.  The code
uses os.listdir(<type 'str'>) to find images and movies in a directory
subsequent to deciding if they contain pornography.  A cron job runs the
code and the messages it prints are sent by cron to the system admins to
take action on.  A customer calls to complain that the code did not
detect that a recently fired employee had a 30 minute pornographic movie
on his office computer.  Arthur has to figure out why.

  - With the current code, Arthur might start with the algorithms that
examines the movies, try to get samples of the pornography from the
company, and look in many wrong places before finding out that the code
that searches for files is not listing all the files in directories.
  - With tracebacks raised, the system admins, at least, will have
received messages from cron stating that the undecodable filenames are
causing errors that need to be addressed.  They can call Arthur's
company when they notice this and Arthur can fix it quickly because the
traceback contains all the necessary information.

-Toshio

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-dev/attachments/20081207/525cefae/attachment.pgp>


More information about the Python-Dev mailing list