[Python-Dev] Import and unicode: part two

Thu Jan 27 03:43:11 CET 2011

On 1/26/2011 4:47 PM, Toshio Kuratomi wrote:
> There's one further case that I am worried about that has no real
> "transfer".  Since people here seem to think that unicode module names are
> the future (for instance, the comments about redefining the C locale to
> include utf-8 and the comments about archiving tools needing to support
> encoding bits), there are eventually going to be unicode modules that become
> dependencies of other modules and programs.  These will need to be installed
> on systems.  Linux distributions that ship these will need to choose
> a filesystem encoding for the filenames of these.  Likely the sensible thing
> for them to do is to use utf-8 since all the ones I can think of default to
> utf-8.  But, as Stephen and Victor have pointed out, users change their
> locale settings to things that aren't utf-8 and save their modules using
> filenames in that encoding.  When they update their OS to a version that has
> utf-8 python module names, they will find that they have to make a choice.
> They can either change their locale settings to a utf-8 encoding and have
> the system installed modules work or they can leave their encoding on their
> non-utf-8 encoding and have the modules that they've created on-site work.
>
> This is not a good position to put users of these systems in.

The way this case should work, is that programs that install files 
(installation is a form of transfer) should transform their names from 
the encoding used in the transfer medium to the encoding of the 
filesystem on which they are installed.

Python3 should access the files, transforming the names from the 
encoding of the filesystem on which they are installed to Unicode for 
use by the program.

I think Python3 is trying to do its part, and Victor is trying to make 
that more robust on more platforms, specifically Windows.

The programs that install files, which may include programs that install 
Python files I don't know, may or may not be doing their part, but 
clearly there are cases where they do not.

Systems that have different encodings for names on the same or different 
file systems need to have a way to obtain the encoding for the file 
names, so they can be properly decoded.  If they don't have such a way, 
they are broken.

=====
The rest of this is an attempt to describe the problem of Linux and 
other systems which use byte strings instead of character strings as 
file names.  No problem, as long as programs allow byte strings as file 
names.  Python3 does not, for the import statement, thus the problem is 
relevant for discussion here, as has been ongoing.
=====

Since file names are defined to be byte strings, there is no way to 
obtain the encoding for file names, so they cannot always be decoded, 
and sometimes not properly decoded, because no one knows which encoding 
was used to create them, _if any_.

Hence, Linux programs that use character strings as file names 
internally and expect them to match the byte strings in the file system 
are promoting a fiction: that there is a transformation (encoding) from 
character strings to byte strings that will match.

When using ASCII character strings, they can be transformed to bytes 
using a simple transformation: identity... but that isn't necessarily 
correct, if the files were created using EBCDIC (unlikely on Linux 
systems, but not impossible, since Linux files are byte strings).

When using non-ASCII character strings, the fiction promoted is even 
bigger, and the transformation even harder.  Any 8-bit character 
encoding can pretend that identity is the correct transformation, but 
the result is mojibake if it isn't.  Unicode other multi-byte encodings 
have an even harder job, because there can be 8-bit sequences that are 
not legal for some transformations, but are legal for others.  This is 
when the fiction is exposed!

As the recent description of glib points out, when the file names are 
read as bytes, and shown to the user for selection, possibly using some 
mojibake-generating transformation to characters, the user has a 
fighting chance to pick the right file, less chance if the 
transformation is lossy ('?' substitutions, etc.) and/or the names are 
redundant in their lossless characters.

However, when the specification of the name is in characters (such as 
for Python import, or file names specified as character constants in any 
application system that provides/permits such), and there are large 
numbers of transformations that could be used to convert characters to 
bytes, the problem is harder, and error-prone... programs that want to 
promote the fiction of using characters for filenames must work harder.  
It seems that Python on Linux is such a program.

One technique is to have conventions agreed on by applications and users 
to limit the number of encodings used on a particular system to one 
(optimal) or a few, the latter requires understanding that files created 
in one encoding may not be accessible by systems that use a different 
one... until they are renamed.  Subsets of applications and users can 
the happily share files with others of their encoding, and with the 
subset of files that can be decoded successfully by their encoding, even 
though it is not correct.   (often ASCII, or a few mojibake characters 
learned for cross-subset usage.) When multiple encodings are used 
without such conventions, chaos results.

Another technique that would be amusing is to use Base64 (as Oleg 
suggested), URL-encoding, or some other mapping that transforms 
non-ASCII names to ASCII character sequences and the identity mapping to 
obtain bytes, and then Python could ship such files to any system, as 
long as it always included that mapping as one of the encodings it would 
try to find files.  This would probably be the most powerful solution, 
but would only need to be applied to those systems that do not use 
characters for filenames.  It could, in fact, be applied on any system 
that uses a subset of characters for filenames, and hence transcends the 
need for Unicode support in a file system to use Unicode names in 
Python3 import statements.  It would likely be problematical for use 
with 3rd-party libraries, however.

Another technique would be to try each possible encoding in turn, in 
some defined order, and the filesystem searched for that byte string as 
a file name, possibly matching files that shouldn't have been matched.  
To limit that search, such programs could allow configuration of an 
smaller ordered list of encodings to be tried to limit the search, and a 
specific one to be used for the creation of new files; this opens up the 
possibility of not trying the "right" encoding, for some rogue file name.

This would be an issue and implementation for Linux systems, but would 
not need to be used on systems such as MacOS (which defines a particular 
encoding) or Windows (which defines a particular encoding) etc.  When 
mounting filesystems that use byte string file names on systems with a 
define encoding, it should be the responsibility of the mounting system 
to do such transformations, and possibly have such configurations, and 
possibly have mappings or renaming facilities, and possibly prohibit 
access to files whose names cannot be transformed (of course, one can 
always punt by configuring latin-1 or other encodings that can match any 
byte string, but that produces mojibake, and then there is no surety 
that particular files will appear to have the name that programs expect).

Of course, Victor's patch is addressing Windows issues, and Windows has 
defined encodings, it is just a matter of using the proper APIs to see 
them, and should be accepted.

It sounds like the current situation on Linux is that Python can access 
the subset of files that match the locale encoding for which it is run.  
It sounds like it would be inappropriate for Python to begin shipping 
files with non-ASCII names as part of its Linux distribution, unless 
facilities are created or tools used to remap non-ASCII names to the 
local locale encoding.  Locales that are not ASCII supersets (in 
character repertoire, not encoding) could not be supported.  Locales 
that do not support all the characters used in files shipped with Python 
could not be supported.  Since locales vary wildly in their available 
non-ASCII names, that limits Python eithr to shipping ASCII names only, 
or restricting the locales that are supported to those that support the 
characters used.

I suppose  that Victor's patch would point out most or all the places 
where such transformations would have to be implemented, if it is 
important to support systems having byte string file names whose users 
cannot agree to use a single encoding for transforming to/from characters.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110126/d21c78ea/attachment.html>