[Python-Dev] Unicode strings as filenames

Neil Hodgson nhodgson@bigpond.net.au
Sun, 6 Jan 2002 09:37:39 +1100


   Explored the possibility of detecting Unicode arguments to open and using
_wfopen on Windows NT. This led to trying to store Unicode strings in the
f_name and f_mode fields of the file object which started to escalate into
complexity making Mark's mbcs choice more understandable.

   Another approach is to use utf-8 as the Py_FileSystemDefaultEncoding and
then convert to and from in each file system access function. The core file
open function from fileobject.c changed to work with utf-8 is at the end of
this message with the important lines in the #ifdef MS_WIN32 section. Along
with that change goes a change in Py_FileSystemDefaultEncoding to be "utf-8"
rather than "mbcs".

   This change works for me on Windows 2000 and allows access to all files
no matter what the current code page is set to. On Windows 9x (not yet
tested), the _wfopen call should fail causing a fallback to fopen. Possibly
the OS should be detected instead and _wfopen not attempted on 9x. On 9x,
mbcs may be a better choice of encoding although it may also be possible to
ask the file system to find the wide character file name and return the
mangled short name that can then be used by fopen.

   The best approach to me seems to be to make Py_FileSystemDefaultEncoding
settable by the user, at least allowing the choice between 'utf-8' and
'mbcs' with a default of 'utf-8' on NT and 'mbcs' on 9x.

   This approach can be extended to other file system calls with, for
example, os.listdir and glob.glob upon detecting a utf-8 default encoding,
using wide character system calls and converting to utf-8.

   Please criticise any stylistic or correctness issues in the code as it is
my first modification to the Python sources.

   Neil

static PyObject *
open_the_file(PyFileObject *f, char *name, char *mode)
{
 assert(f != NULL);
 assert(PyFile_Check(f));
 assert(name != NULL);
 assert(mode != NULL);
 assert(f->f_fp == NULL);

 /* rexec.py can't stop a user from getting the file() constructor --
    all they have to do is get *any* file object f, and then do
    type(f).  Here we prevent them from doing damage with it. */
 if (PyEval_GetRestricted()) {
  PyErr_SetString(PyExc_IOError,
   "file() constructor not accessible in restricted mode");
  return NULL;
 }
 errno = 0;
#ifdef HAVE_FOPENRF
 if (*mode == '*') {
  FILE *fopenRF();
  f->f_fp = fopenRF(name, mode+1);
 }
 else
#endif
 {
  Py_BEGIN_ALLOW_THREADS
#ifdef MS_WIN32
  if (strcmp(Py_FileSystemDefaultEncoding, "utf-8") == 0) {
            PyObject *wname;
            PyObject *wmode;
            wname = PyUnicode_DecodeUTF8(name, strlen(name), "strict");
            wmode = PyUnicode_DecodeUTF8(mode, strlen(mode), "strict");
   if (wname && wmode) {
    f->f_fp = _wfopen(PyUnicode_AS_UNICODE(wname),
PyUnicode_AS_UNICODE(wmode));
   }
            Py_XDECREF(wname);
            Py_XDECREF(wmode);
  }
  if (NULL == f->f_fp) {
   f->f_fp = fopen(name, mode);
  }
#else
  f->f_fp = fopen(name, mode);
#endif
  Py_END_ALLOW_THREADS
 }
 if (f->f_fp == NULL) {
#ifdef NO_FOPEN_ERRNO
  /* Metroworks only, wich does not always sets errno */
  if (errno == 0) {
   PyObject *v;
   v = Py_BuildValue("(is)", 0, "Cannot open file");
   if (v != NULL) {
    PyErr_SetObject(PyExc_IOError, v);
    Py_DECREF(v);
   }
   return NULL;
  }
#endif
  if (errno == EINVAL)
   PyErr_Format(PyExc_IOError, "invalid argument: %s",
         mode);
  else
   PyErr_SetFromErrnoWithFilename(PyExc_IOError, name);
  f = NULL;
 }
 return (PyObject *)f;
}