[Python-ideas] Fix default encodings on Windows

Chris Angelico rosuav at gmail.com
Thu Aug 18 12:00:43 EDT 2016


On Fri, Aug 19, 2016 at 1:54 AM, Steve Dower <steve.dower at python.org> wrote:
> On 18Aug2016 0829, Chris Angelico wrote:
>>
>> The second call to glob doesn't have any Unicode characters at all,
>> the way I see it - it's all bytes. Am I completely misunderstanding
>> this?
>
>
> You're not the only one - I think this has been the most common
> misunderstanding.
>
> On Windows, the paths as stored in the filesystem are actually all text -
> more precisely, utf-16-le encoded bytes, represented as 16-bit characters
> strings.
>
> Converting to an 8-bit character representation only exists for
> compatibility with code written for other platforms (either Linux, or much
> older versions of Windows). The operating system has one way to do the
> conversion to bytes, which Python currently uses, but since we control that
> transformation I'm proposing an alternative conversion that is more reliable
> than compatible (with Windows 3.1... shouldn't affect compatibility with
> code that properly handles multibyte encodings, which should include
> anything developed for Linux in the last decade or two).
>
> Does that help? I tried to keep the explanation short and focused :)

Ah, I think I see what you mean. There's a slight ambiguity in the
word "missing" here.

1) The Unicode character in the result lacks some of the information
it should have

2) The Unicode character in the file name is information that has now been lost.

My reading was the first, but AIUI you actually meant the second. If
so, I'd be inclined to reword it very slightly, eg:

"The Unicode character in the second call to glob is now lost information."

Is that a correct interpretation?

ChrisA


More information about the Python-ideas mailing list