[Python-ideas] Fix default encodings on Windows

Mon Aug 15 21:39:35 EDT 2016

On 15Aug2016 1819, eryk sun wrote:
> On Mon, Aug 15, 2016 at 6:26 PM, Steve Dower <steve.dower at python.org> wrote:
>>
>> (Frankly I don't mind what encoding we use, and I'd be quite happy to force bytes
>> paths to be UTF-16-LE encoded, which would also round-trip invalid surrogate
>> pairs. But that would prevent basic manipulation which seems to be a higher
>> priority.)
>
> The CRT manually decodes and encodes using the private functions
> __acrt_copy_path_to_wide_string and __acrt_copy_to_char. These use
> either the ANSI or OEM codepage, depending on the value returned by
> WinAPI AreFileApisANSI. CPython could follow suit. Doing its own
> encoding and decoding would enable using filesystem functions that
> will never get an [A]NSI version (e.g. GetFileInformationByHandleEx),
> while still retaining backward compatibility.
>
> Filesystem encoding could use WC_NO_BEST_FIT_CHARS and raise a warning
> when lpUsedDefaultChar is true. Filesystem decoding could use
> MB_ERR_INVALID_CHARS and raise a warning and retry without this flag
> for ERROR_NO_UNICODE_TRANSLATION (e.g. an invalid DBCS sequence). This
> could be implemented with a new "warning" handler for
> PyUnicode_EncodeCodePage and PyUnicode_DecodeCodePageStateful. A new
> 'fsmbcs' encoding could be added that checks AreFileApisANSI to choose
> betwen CP_ACP and CP_OEMCP.

None of that makes it less complicated or more reliable. Warnings based 
on values are bad (they should be based on types) and using the *W APIs 
exclusively is the right way to go. The question then is whether we 
allow file system functions to return bytes, and if so, which encoding 
to use. This then directly informs what the functions accept, for the 
purposes of round-tripping.

*Any* encoding that may silently lose data is a problem, which basically 
leaves utf-16 as the only option. However, as that causes other 
problems, maybe we can accept the tradeoff of returning utf-8 and 
failing when a path contains invalid surrogate pairs (which is extremely 
rare by comparison to characters outside of CP_ACP)?

If utf-8 is unacceptable, we're back to the current situation and should 
be removing the support for bytes that was deprecated three versions ago.

Cheers,
Steve