[Python-Dev] My work on Python3 and non-ascii paths is done

Tue Oct 19 10:29:20 CEST 2010

Victor Stinner wrote:
> Hi,
> 
> Seven months after my first commit related to this issue, the full test suite 
> of Python 3.2 pass with ASCII, ISO-8859-1 and UTF-8 locale encodings in a non-
> ascii source directory. It means that Python 3.2 now process correctly 
> filenames in all modules, build scripts and other utilities, with any locale 
> encoding.
> 
> 
> General changes:
> 
>  * Encode/decode filenames with the locale encoding, instead of utf-8,
>    until the filesystem encoding is set
>  * mbcs encoding (Windows filesystem encoding) is now strict by default,
>    whereas it ignores unencodable characters and replace undecodable bytes
>    in Python 3.1. Old behaviour can still be used using the right error
>    handler: 'ignore' to encode, 'replace' to decode.
>  * tarfile uses utf-8 encoding on Windows (instead of mbcs), and the
>    surrogateescape error handler on all OSes
>  * sys.getfilesystemencoding() cannot be None anymore
>  * Don't accept bytearray as filenames anymore
> 
> 
> Changes of the Python API:
> 
>  * Add os.environb: bytes version of os.environ, os.getenvb() function
>    and os.supports_bytes_environ constant
>  * Add os.fsencode() and os.fsdecode() functions
>  * Remove sys.setfilesystemencoding() function
> 
> 
> Changes of the C API:
> 
>  * Add PyUnicode_EncodeFSDefault() function
>  * Add PyUnicode_FSDecoder() ParseTuple converter
>  * Add PySys_FormatStdout(), PySys_FormatStderr() and PyErr_WarnFormat()
>    functions
>  * Add PyUnicode_AsWideCharString() function: don't need a buffer size.
>  * Add Py_UNICODE_strrchr(), Py_UNICODE_strcat(), PyUnicode_AsUnicodeCopy()
>    and Py_UNICODE_strncmp() functions
>  * PyUnicode_DecodeFSDefault() and PyUnicode_DecodeFSDefaultAndSize() use the
>    surrogateescape error handler
>  * File utilities: add _Py_wchar2char() (reverse of Py_char2wchar()),
>    _Py_stat() and _Py_fopen() functions; move all file utilities to
>    Python/fileutils.c
>  * The format string of PyUnicode_FromFormat() and PyErr_Format() is now
>    pure ASCII: raise an error on non-ascii character
>  * PyUnicode_FSConverter() doesn't accept bytearray anymore
> 
> 
> Bugfixes:
> 
>  * Fix modules: tarfile, pickle, pickletools, ctypes, subprocess, bz2, ssl,
>    profile, xmlrpclib, platform, libpython (gdb plugin), sqlite,
>    distutils.log, locale, _warnings, zipimport, imp
>  * Fix functions: os.exec*(), os.system(), ctypes.dlopen(), os.getenv(),
>    os.get_exec_path()
>  * Fix tests: test_gdb, test_httpservers, test_cmd_line, test_size,
>    test_generic_path, test_subprocess, test_doctest, test_cmd_line_script
>  * Fix utf-8 encoder to support error handlers producing unicode string 
>    (eg. 'backslashreplace')
>  * Fix conversion from unicode to a wide character string if Py_UNICODE 
>    and wchar_t have different sizes: UTF-16 => UTF-32 or UTF-32 => UTF-16
>  * Fix Python command line parser if the the command line contains surrogates
>  * Avoid _PyUnicode_AsString() because it returns NULL if the string contains
>    surrogates, or catch the error
>  * Fix regrtest.py to support surrogate characters in the current working
>    directory and in the tracebacks
> 
> 
> I wrote also some tests and documentation.
> 
> The most difficult part was to debug Python initialization (Py_InitializeEx 
> and calculate_path) and the import machinery (import.c, zipimport.c), because 
> gdb does sometimes crash (for various reasons) and because  the import 
> machinery is fragile and difficult to understand.
> 
> A special thanks to Marc-Andre Lemburg, Martin v. Löwis, Antoine Pitrou and 
> Amaury Forgeot d'Arc for their help, useful advices and code reviews!

Many thanks to you for opening up this can of worms and fighting
through all the issues !

> -- Bonus: short story of PYTHONFSENCODING ---
> 
> In the middle of August, I created the PYTHONFSENCODING environment variable, 
> as suggested by Marc-Andre Lemburg. Because of this variable and because 
> Python used utf-8 until the filesystem encoding is known, I had to write ugly 
> and fragile "redecode" functions to redecode all filenames of all objects 
> (sys.path, sys.meta_path, sys.executable, sys.modules, all code objects, 
> etc.).
> 
> Then I found 4 issues related to PYTHONFSENCODING, inconsistencies between the 
> filesystem encoding and the locale encoding. It was not easy to decide how to 
> fix these issues, but at the end, we choosed to drop PYTHONFSENCODING 
> variable, use the locale encoding as the filesystem encoding, and always use 
> utf-8 as the filesystem encoding on Mac OS X.

Time will tell whether we can manage without some logic to tell
Python what to use as file system encoding without having
to rely on the locale settings. Anything is better than having
Python stop with a fatal error, though :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 19 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/