[Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

Tue Oct 25 05:23:54 CEST 2011

+1 from me!

Mark

On 25/10/2011 9:57 AM, Victor Stinner wrote:
> Hi,
>
> I propose to raise Unicode errors if a filename cannot be decoded on Windows,
> instead of creating a bogus filenames with questions marks. Because this change
> is incompatible with Python 3.2, even if such filenames are unusable and I
> consider the problem as a (Python?) bug, I would like your opinion on such
> change before working on a patch.
>
> --
>
> Windows works internally on Unicode strings since Windows 95 (or something
> like that), but provides also an "ANSI" API using the ANSI code page and byte
> strings for backward compatibility. It was already proposed to drop completly
> the bytes API in our nt (os) module, but it may break the Python backward
> compatibility (and it is difficult to list Python programs using the bytes API
> to access the file system).
>
> The ANSI API uses MultiByteToWideChar (decode) and WideCharToMultiByte
> (encode) functions in the default mode (flags=0): MultiByteToWideChar()
> replaces undecodable bytes by '?' and WideCharToMultiByte() ignores
> unencodable characters (!!!). This behaviour produces invalid filenames (see
> for example the issue #13247) and *the user is unable to detect codec errors*.
>
> In Python 3.2, I changed the MBCS codec to make it strict: it raises a
> UnicodeEncodeError if a character cannot be encoded to the ANSI code page
> (e.g. encode Ł to cp1252) and a UnicodeDecodeError if a character cannot be
> decoded from the ANSI code page (e.g. b'\xff' from cp932).
>
> I propose to reuse our MBCS codec in strict mode (error handler="strict"), to
> notice directly encode/decode errors, with the Windows native (wide character)
> API. It should simplify the source code: replace 2 versions of a function by 1
> version + optional code to decode arguments and/or encode the result.
>
> --
>
> Read also the previous thread:
>
> [Python-Dev] Byte filenames in the posix module on Windows
> Wed Jun 8 00:23:20 CEST 2011
> http://mail.python.org/pipermail/python-dev/2011-June/111831.html
>
> --
>
> FYI I patched again Python MBCS codec: it now handles correclty ignore and
> replace mode (to encode and decode), but now also supports any error handler.
>
> --
>
> We might use the PEP 383 to store undecoable bytes as surrogates (U+DC80-
> U+DCFF). But the situation is the opposite of the situtation on UNIX: on
> Windows, the problem is more on encoding (text->bytes) than on decoding
> (bytes->text). On UNIX, problems occur when the system is misconfigured (e.g.
> wrong locale encoding). On Windows, problems occur when your application uses
> the old (ANSI) API, whereas your filesystem is fully Unicode compliant and you
> created Unicode filenames with a program using the new (Windows) API.
>
> Only few programs are fully Unicode compliant. A lot of programs fail if a
> filename cannot be encoded to the ANSI code page (just 2 examples: Mercurial
> and Visual Studio).
>
> Victor
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/skippy.hammond%40gmail.com