[Python-Dev] Unicode Imports

"Martin v. Löwis" martin at v.loewis.de
Sat Sep 9 21:16:45 CEST 2006


David Hopwood schrieb:
> On Windows, file system pathnames can contain arbitrary Unicode characters
> (well, almost). Despite the existence of "ANSI" filesystem APIs, and
> regardless of what 'sys.getfilesystemencoding()' returns, the underlying
> file system encoding for NTFS and FAT filesystems is UTF-16LE.
> 
> Thus, either:
>  - the fact that sys.getfilesystemencoding() returns a non-Unicode encoding
>    on Windows is a bug, or
>  - any program that relies on sys.getfilesystemencoding() being able to
>    encode arbitrary Windows pathnames has a bug.
> 
> We need to decide which of these is the case.

There is a third option:
- the operating system has a bug

It is actually this option that rules out the other two.
sys.getfilesystemencoding() returns "mbcs" on Windows, which means
CP_ACP. The file system encoding is an encoding that converts a
file name into a byte string. Unfortunately, on Windows, there are
file names which cannot be converted into a byte string in a standard
manner. This is an operating system bug (or mis-design; they should
have chosen UTF-8 as the byte encoding of file names, instead of
making it depend on the system locale, but they of course did so
for backwards compatibility with Windows 3.1 and 9x).

As a side note: every encoding in Python is a Unicode encoding;
so there aren't any "non-Unicode encodings".

Programs that rely on sys.getfilesystemencoding() being able to
represent arbitrary file names on Windows might have a bug;
programs that rely on sys.getfilesystemencoding() being able
to encode all elements of sys.path do not (atleast not for
Python 2.5 and earlier).

Regards,
Martin



More information about the Python-Dev mailing list