Files with japanese filenames under Windows

Mon Jan 7 09:37:09 EST 2002

Hello,

we found a number of problems concerning the handling of japanese filenames
on the Windows platform.  Can you please tell me where to find more information
on these issues or what I can do my self to fix some of this?

1. In Python 2.2, you can open() or os.listdir() files or directories
   and specify their name as unicode objects instead of strings.
   Unfortunately, this works only if the unicode object you specify is
   representable in the codepage determined by the systems "regional
   options".  Once you change the "regional options" from japanese to
   english, unicode filenames that were previously accessible become
   inaccessible.

   As far as I understand, this is because Python internally converts
   the unicode filename to the system codepage (determined by the
   "regional options".)  On western (US, european) systems, the system
   codepage is some 8 bit codepage that cannot represent the japanese
   name, so the name gets corrupted and may not be accessed.

   The problem is really, that files (that may reside on some remote
   server) with japanese filenames may not be accessed from Python on
   a Windows computer with a non japanese "regional option" setting.

   As far as I know, to solve this either the corresponding locale
   setting has to be changed (which, to my knowledge, the Python API's
   to locale do not allow), or special
   Windoze-Wide-String-Unicode-What-So-Ever API's have to be used to
   access (open or listdir) the Unicode filename.

   Because I don't know how to use locale and I can't find out because
   I don't find any documentation that explains locale in a way I can
   understand (any reccomendation?), I am thinking of implementing the
   second either as an extension for my purpose or maybe as a patch to
   Python it self.  Maybe somebody else has solved this allready?

2. When trying to access a file with a unicode name that gets corrupt
   during internal conversion, a misleading and incorrect error
   message is produced in Python 2.2 (I tried to report this for the
   beta, but some way my report got lost):

>>> open(u'\u30c6\u30b9\u30c8\u7528\u30d5\u30a9\u30eb\u30c0\\test1en.doc')
Traceback (most recent call last):
  File "<pyshell#0>", line 1, in ?
    open(u'\u30c6\u30b9\u30c8\u7528\u30d5\u30a9\u30eb\u30c0\\test1en.doc')
IOError: invalid argument: r
>>> open(u'\u30c6\u30b9\u30c8\u7528\u30d5\u30a9\u30eb\u30c0\\test1en.doc', 'rb')
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in ?
    open(u'\u30c6\u30b9\u30c8\u7528\u30d5\u30a9\u30eb\u30c0\\test1en.doc', 'rb')
IOError: invalid argument: rb

   (the problem is with the filename and has nothing to do with the
   optional second argument to open(), what the message reports)

3. If I os.listdir() a directory, I get a list of plain strings, on a
   japanese system encoded with the (there standard) sjis encoding.
   If I want to use these names for something else than immediately
   opening the files on the same system, a unicode representation of
   the name would be better.  To get this, I have to query for the
   filename encoding, but I don't know how to do this in a portable
   (also works under Unix) way.

4. os.path.split() does not (allways) give the correct result on a
   japanese system, because in the there standard sjis encoding the
   "/" may appear as part of a multibyte character.  First converting
   to unicode, then using os.path.split() gives the right result, but
   again I have to get the system codepage for filenames to do this.
   Arithmetic that involves os.listdir, os.path.split and os.path.join
   may fail miserably on a japanese system, if some troublesome
   filenames are used (and, of course, in reality these names are used
   - I will be glad to provide examples, if anyone needs them...)

Thanks in advance for any help.
- Guenter