[Python-ideas] Fix default encodings on Windows

eryk sun eryksun at gmail.com
Thu Aug 18 12:39:45 EDT 2016


On Thu, Aug 18, 2016 at 4:07 PM, Steve Dower <steve.dower at python.org> wrote:
> On 18Aug2016 0900, Chris Angelico wrote:
>>
>> On Fri, Aug 19, 2016 at 1:54 AM, Steve Dower <steve.dower at python.org>
>> wrote:
>>>
>>> On 18Aug2016 0829, Chris Angelico wrote:
>>>>
>>>>
>>>> The second call to glob doesn't have any Unicode characters at all,
>>>> the way I see it - it's all bytes. Am I completely misunderstanding
>>>> this?
>>>
>>>
>>>
>>> You're not the only one - I think this has been the most common
>>> misunderstanding.
>>>
>>> On Windows, the paths as stored in the filesystem are actually all text -
>>> more precisely, utf-16-le encoded bytes, represented as 16-bit characters
>>> strings.
>>>
>>> Converting to an 8-bit character representation only exists for
>>> compatibility with code written for other platforms (either Linux, or
>>> much
>>> older versions of Windows). The operating system has one way to do the
>>> conversion to bytes, which Python currently uses, but since we control
>>> that
>>> transformation I'm proposing an alternative conversion that is more
>>> reliable
>>> than compatible (with Windows 3.1... shouldn't affect compatibility with
>>> code that properly handles multibyte encodings, which should include
>>> anything developed for Linux in the last decade or two).
>>>
>>> Does that help? I tried to keep the explanation short and focused :)
>>
>>
>> Ah, I think I see what you mean. There's a slight ambiguity in the
>> word "missing" here.
>>
>> 1) The Unicode character in the result lacks some of the information
>> it should have
>>
>> 2) The Unicode character in the file name is information that has now been
>> lost.
>>
>> My reading was the first, but AIUI you actually meant the second. If
>> so, I'd be inclined to reword it very slightly, eg:
>>
>> "The Unicode character in the second call to glob is now lost
>> information."
>>
>> Is that a correct interpretation?
>
>
> I think so, though I find the wording a little awkward (and on rereading, my
> original wording was pretty bad). How about:
>
> "The second call to glob has replaced the Unicode character with '?', which
> means the actual filename cannot be recovered and the path is no longer
> valid."

They're all just characters in the context of Unicode, so I think it's
clearest to use the character code, e.g.:

    The second call to glob has replaced the U+AB00 character with '?',
    which means ...


More information about the Python-ideas mailing list