[Python-Dev] Import and unicode: part two

Thu Jan 20 18:44:39 CET 2011

On Thu, Jan 20, 2011 at 12:51:29PM +0100, Victor Stinner wrote:
> Le mercredi 19 janvier 2011 à 20:39 -0800, Toshio Kuratomi a écrit :
> > Teaching students to write non-portable code (relying on filesystem encoding
> > where your solution is, don't upload to pypi anything that has non-ascii
> > filenames) seems like the exact opposite of how you'd want to shape a young
> > student's understanding of good programming practices.
> 
> That was already discuted before: see PEP 3131.
> http://www.python.org/dev/peps/pep-3131/#common-objections
> 
> If the teacher choose to use non-ASCII, (s)he is responsible to explain
> the consequences to his/her students :-)
> 
It's not discussed in that PEP section.

The PEP section says this: "People claim that they will not be able to use
a library if to do so they have to use characters they cannot type on their
keyboards."

Whether you can type it at your keyboard or not is not the problem here.
The problem is portability.  The students and professors are sharing code
with each other.  But because of a mixture of operating systems (let alone
locale settings), the code written by one partner is unable to run on the
computer of the other.

If non-ascii filenames without a defined encoding are considered a feature,
python cannot even issue a descriptive error when this occurs.  It can only
say that it could not find the module but not why.  A restriction on module
names to ascii only could actually state that module names are not allowed
to be non-ASCII when it encounters the import line.

> > > In a school, you can use the same configuration
> > > (encoding) on all computers.
> > > 
> > In a school computer lab perhaps.  But not on all the students' and
> > professors' machines.  How many professors will be cursing python when they
> > discover that the example code that they wrote on their Linux workstation
> > doesn't work when the students try to use it in their windows computer lab?
> 
> Because some students use a stupid or misconfigured OS, Python should
> only accept ASCII names?

Just a note -- you'll get much farther if you refrain from calling names.
It just makes me think that you aren't reading and understanding the issue
I'm raising.  My examples that you're replying to involve two "properly
configured" OS's.  The Linux workstations are configured with a UTF-8
locale.  The Windows OS's use wide character unicode.  The problem occurs in
that the code that one of the parties develops (either the students or the
professors) is developed on one of those OS's and then used on the other OS.

> So, why do Python 3 support non-ASCII
> filenames: it is very well known that non-ASCII filenames is the root in
> many troubles! Should we simply drop unicode support for all filenames?
> And maybe restrict bytes filenames to bytes in [0; 127]? Or better,
> restrict to [32; 126] (U+007f causes some troubles in some terminals).
> 
If you want to argue that because python3 supports non-ascii filenames in
other code, then the logical extension is that the import mechanism should
support importing module names defined by byte sequences.  I happen to think
that import has a lot of differences between it and other filenames as I've
said three times now.

> I think that in 2011, non-ASCII filenames are well supported on all
> (modern) operating systems. Issues with non-ASCII filenames are OS
> specific and should be fixed by the user (the admin of the computer).
> 
> > Additionally, those other filesystem operations have
> > been growing the ability to take byte values and encoding parameters because
> > unicode translation via a single filesystem encoding is a good default but
> > not a complete solution.
> 
> If you are unable to configure correctly your system to decode/encode
> correctly filenames, you should just avoid non-ASCII characters in the
> module names.
> 
This seems like an argument to only have unicode versions of all filesystem
operations.  Since you've been spearheading the effort to have bytes
versions of things that access filenames, environment variables, etc,
I don't think that you seriously mean that.  Perhaps there is a language
issue here.

> You only give theorical arguments: did you at least try to use non-ASCII
> module names on your system with Python 3.2? I suppose that it will just
> work and you will never notice that the unicode module name (on "import
> café") in encoded to bytes.
> 
Yes I did and I got it to fail a cornercase as I showed twice with the same
example in other posts.  However, I want to make clear here that the issue
is not that I can create a non-ascii filename and then import it.  The issue
is that I can create a non-ascii filename and then try to share it with the
usual tools and it won't work on the recipient's system.  (A tangent is
whether the recipient's system is physically distinct from mine or only has
a different environment on the same physical host.)

> It fails on on OSes using filesystem encodings other than UTF-8 (eg.
> Windows)... because of a Python bug, and I just asked if I have to fix
> this bug (or if we should deny non-ASCII names). If the bug is fixed, it
> will works everywhere.
> 
I understand that your patch allows non-ASCII names to work on Windows.  My
issue is that non-ASCII names have ramifications beyond just, "works on
Windows"  "works on Linux".  There's also the question of whether it works
when you transfer modules between OS's.

> > Your solution creates modules which aren't portable
> 
> More and more operating systems use a filesystem encoding able to encode
> any Unicode characters. ASCII-only always give you the best portability,
> but I think that today you can start to play with (at least) ISO-8859-1
> characters (café should work on all operating systems). If you don't
> Unicode issues (I personally love them!), just use ASCII everywhere.
> 
I'd be happy to agree with your enthusiasm for unicode characters if your
patch included a method to preserve portability between operating systems.

> > One of my proposals creates python code which isn't portable.  The other one
> > suffers some of the same disadvantages as your solution in portability but
> > allows for tools that could automatically correct modules.
> 
> __import__('café'.encode('UTF-8')) or
> __import__('café'.encode('ISO-8859-1')) is less portable than
> __import__('café').
> 
Yep, this method is just as unportable as yours as I said in an anlysis in
a previous post.  The other method is the one that's more portable but has
painful drawbacks.

(Also note that your example above ignores one of the differences between
import and open() that I mentioned in a previous post:  import assigns the
module to a name automatically whereas open() [like__import__()] makes the
programmer assign the name)

> > You think that if a module is named appropriately on one system but is not portable to another
> > system, that's fine.
> 
> No, I am not saying that.
> 
> I say that if your name is broken while you transfer your project from a
> system to another (eg. decompressing an archive creates filenames with
> mojibake in the filenames), you should fix your transfer procedure (eg.
> use another archive format, use a script to fix filenames, or anything
> else), but don't try to handle invalid filenames.
> 
So here's a revised summary:

A module being able to be imported by the module author is of primary
importance.  Portability of modules relies upon third party tool support.
Lacking that support, the modules may not be portable.

> > Setting system locale to ASCII for use in system-wide scripts
> 
> This is stupid :-) Yes, on such system you, cannot open *any* non-ASCII
> file with Python 3 (except if you work, as Python 2, on bytes
> filenames).
> 
> Python cannot do anything to improve Unicode support on such system:
> only the administrator have to something to do for that.
> 
Python supports open() with a bytes argument for this reason.  import does
not support such a thing (and I think it would be more wrong for import to
do so).

> I know that you can give me many examples of systems where Unicode
> doesn't work because the system is not correctly configured. But my
> opinion is that we should support non-ASCII names because there are
> somewhere "some" systems where Unicode is fully functionnal :-)
> 
Comments like these make me think that you aren't understanding me which
just makes me frustrated with you.  OTOH, if you could acknowledge the
points that I'm making and simply disagree with the relative merits of them
then we could simply agree to disagree.

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110120/d4ebb280/attachment-0001.pgp>