[Python-Dev] Import and unicode: part two

Thu Jan 20 03:51:05 CET 2011

Le mercredi 19 janvier 2011 à 18:07 -0800, Toshio Kuratomi a écrit :
> Saying that multiple encodings on a single system is a misconfiguration
> every time it comes up does not make it true.

Yes, each filesystem can have its own encoding. For example, this is
supported by Linux. Python doesn't support such configuration, but this
limitation is wider than the import machinery. If you consider it import
enough, please open an issue.

> To the existing list I'd add getting a package from pypi --
> neither tar nor zip files contain encoding information about the filenames.

ZIP contain a flag to indicate the encoding: cp437 or UTF-8.

TAR has an extension called "PAX" which stores filenames as UTF-8. But
yes, most tarballs store filenames as raw byte strings.

Anyway, if you would like to share your code on PyPI, you should not use
non-ASCII module names (or any other non-ASCII name/identifier :-)).

Python 3 supports non-ASCII identifiers (PEP 3131), but the developer is
responsible to decide if (s)he uses it or not, depending on its
audience. For a lesson at school, it is nice to write examples in the
mother language, instead of using "raw" english with ASCII identifiers
and filenames. In a school, you can use the same configuration
(encoding) on all computers.

> > > * Specify an encoding per platform and stick to that.
> > 
> > It doesn't work: on UNIX/BSD, the user chooses its own encoding and all
> > programs will use it.
> > 
> (...) This prevents getting a mixture of encodings of modules (...)

If you have an issue with encodings, when have to fix it when you create
a module (on disk), not when you load a module (it is too late).

> (...) I mean something at the python code level::
> 
>    import café encoded_as('latin1')

Import a module using its byte name? You mean that café filename was not
encoded to the Python filesystem encoding, but to other (wrong)
encoding, at the creation of the module. As written before, you should
fix your filename, instead of using an (ugly) workaround in Python.

> I haven't looked at your patch so
> perhaps you have an ingenous method of translating from the unicode
> representation of the module in the import statement to the bytes in
> arbitrary encodings on the filesystem that I haven't thought of.

On Windows, My patch tries to avoid any conversion: it uses unicode
everywhere.

On other OSes, it uses the Python filesystem encoding to encode a module
name (as it is done for any other operation on the filesystem with an
unicode filename).

--

Python 3 supports bytes filename to be able to read/copy/delete
undecodable filenames, filenames stored in a encoding different than the
system encoding, broken filenames. It is also possible to access these
files using PEP 383 (with surrogate characters). This is useful to use
Python on an old system.

> If you don't, however, then really - ASCII-only seems like the sanest 
> of the three solutions I can think of.

But a (Python 3) module is not supposed to have a broken filename. If it
is the case, you have better to fix its name, instead of trying to fix
the problem later (in Python).

With UTF-8 filesystem encoding (eg. on Mac OS X, and most Linux setups),
it is already possible to use non-ASCII module names.

Victor