[Python-ideas] Fix default encodings on Windows

Wed Aug 10 20:41:01 EDT 2016

On Thu, Aug 11, 2016 at 9:40 AM, Steve Dower <steve.dower at python.org> wrote:
> On 10Aug2016 1431, Chris Angelico wrote:
>> I'd rather a single consistent default encoding.
>
> I'm proposing to make that single consistent default encoding utf-8. It
> sounds like we're in agreement?

Yes, we are. I was disagreeing with Random's suggestion that mbcs
would also serve. Defaulting to UTF-8 everywhere is (a) consistent on
all systems, regardless of settings; and (b) consistent with
bytes.decode() and str.encode(), both of which default to UTF-8.

>> -0.5. Is there any precedent for this kind of data-based detection
>> being the default? An explicit "utf-sig" could do a full detection,
>> but even then it's not perfect - how do you distinguish UTF-32LE from
>> UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll
>> assume UTF-16", or do you say "files starting U+0000 are rare, so
>> we'll assume UTF-32"?
>
>
> The BOM exists solely for data-based detection, and the UTF-8 BOM is
> different from the UTF-16 and UTF-32 ones. So we either find an exact BOM
> (which IIRC decodes as a no-op spacing character, though I have a feeling
> some version of Unicode redefined it exclusively for being the marker) or we
> use utf-8.
>
> But the main reason for detecting the BOM is that currently opening files
> with 'utf-8' does not skip the BOM if it exists. I'd be quite happy with
> changing the default encoding to:
>
> * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
> * utf-8 when writing (so the BOM is *not* written)
>
> This provides the best compatibility when reading/writing files without
> making any guesses. We could reasonably extend this to read utf-16 and
> utf-32 if they have a BOM, but that's an extension and not necessary for the
> main change.

AIUI the utf-8-sig encoding is happy to decode something that doesn't
have a signature, right? If so, then yes, I would definitely support
that mild mismatch in defaults. Chew up that UTF-8 aBOMination and
just use UTF-8 as is.

I've almost never seen files stored in UTF-32 (even UTF-16 isn't all
that common compared to UTF-8), so I wouldn't stress too much about
that. Recognizing FE FF or FF FE and decoding as UTF-16 might be worth
doing, but it could easily be retrofitted (that byte sequence won't
decode as UTF-8).

>>> * force the console encoding to UTF-8 on initialize and revert on
>>> finalize
>>
>>
>> -0 for Python itself; +1 for Python's interactive interpreter.
>> Programs that mess with console settings get annoying when they crash
>> out and don't revert properly. Unless there is *no way* that you could
>> externally kill the process without also bringing the terminal down,
>> there's the distinct possibility of messing everything up.
>
>
> The main problem here is that if the console is not forced to UTF-8 then it
> won't render any of the characters correctly.

Ehh, that's annoying. Is there a way to guarantee, at the process
level, that the console will be returned to "normal state" when Python
exits? If not, there's the risk that people run a Python program and
then the *next* program gets into trouble.

But if that happens only on abnormal termination ("I killed Python
from Task Manager, and it left stuff messed up so I had to close the
console"), it's probably an acceptable risk. And the benefit sounds
well worthwhile. Revising my recommendation to +0.9.

ChrisA