[issue3080] Full unicode import system

Wed Jan 19 02:22:13 CET 2011

STINNER Victor <victor.stinner at haypocalc.com> added the comment:

Here is a work-in-progress patch: issue3080-3.patch. The patch is HUGE and written for Python 3.3.

$ diffstat issue3080-3.patch 
 Doc/c-api/module.rst   |   24 
 Include/import.h       |   73 +
 Include/moduleobject.h |    2 
 Include/pycapsule.h    |    4 
 Modules/zipimport.c    |  272 +++---
 Objects/moduleobject.c |   52 -
 PC/import_nt.c         |   84 +-
 Python/dynload_aix.c   |    2 
 Python/dynload_dl.c    |    2 
 Python/dynload_hpux.c  |    2 
 Python/dynload_next.c  |    4 
 Python/dynload_os2.c   |    2 
 Python/dynload_shlib.c |    2 
 Python/dynload_win.c   |    2 
 Python/import.c        | 1910 +++++++++++++++++++++++++++----------------------
 Python/importdl.c      |   79 +-
 Python/importdl.h      |    2 
 issue3080.py           |   29 
 18 files changed, 1484 insertions(+), 1063 deletions(-)

As expected, most of the work in done in import.c.

Decode the module name earlier and encode it later. Try to manipulate PyUnicodeObject objects instead of char* buffers (so we have directly the string length).

Split the huge and very complex find_module() function into 3 functions (find_module, find_module_filename and find_module2) and document them. Drop OS/2 support in find_module() (it can be kept, but it was easier for me to drop it and the OS/2 maintainer wrote that Python 3 is far from being compatible with OS/2).

The patch creates some functions: PyModule_GetNameObject(), PyImport_ExecCodeModuleUnicode(), PyImport_AddModuleUnicode(), PyImport_ImportFrozenModuleUnicode(), PyModule_NewUnicode(), ...

Use "U" format to parse a module name, and "%R" to format a module name (to escape surrogates characters and add quotes, instead of "... '%.200s' ...").

PyWin_FindRegisteredModule() is now private. Remove fqname argument from _PyImport_GetDynLoadFunc(), it wasn't used.

Replace open_exclusive() by fopen(name, "wb") on Windows: is it correct?

TODO:

 - rename xxxobj => xxx to keep original names and have a short patch (eg. I renamed name to nameobj during the transition to detect bugs)
 - catch encoding errors in case_ok()
 - don't encode in case_ok() if case_ok() does nothing (eg. on Linux)
 - find a better name for find_module2()

The patch contains a tiny script, issue3080.py, to test the patch using an ISO-8859-1 locale.

I will open a thread on the mailing list (python-dev) to decide if this patch is needed or not. If we agree that this issue should be fixed, I will split the patch into smaller parts and start a review process.

----------
keywords: +patch
Added file: http://bugs.python.org/file20448/issue3080-3.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3080>
_______________________________________