[Tutor] Changing the interpreter prompt symbol from ">>>" to ???

Tue Mar 15 15:51:57 EDT 2016

> From: eryksun at gmail.com
> Date: Sun, 13 Mar 2016 13:58:46 -0500
> Subject: Re: [Tutor] Changing the interpreter prompt symbol from ">>>" to ???
> To: tutor at python.org
> CC: sjeik_appie at hotmail.com
> 
> On Sun, Mar 13, 2016 at 3:14 AM, Albert-Jan Roskam
> <sjeik_appie at hotmail.com> wrote:
> > I thought that utf-8 (cp65001) is by definition (or by design?) impossible
> > for console output in windows? Aren't there "w" (wide) versions of functions
> > that do accept utf-8?
> 
> The wide-character API works with the native Windows character
> encoding, UTF-16. Except the console is a bit 'special'. A surrogate
> pair (e.g. a non-BMP emoji) appears as 2 box characters, but you can
> copy it from the console to a rich text application, and it renders
> normally. 

That is very useful to know.

> The console also doesn't support variable-width fonts for
> mixing narrow and wide (East Asian) glyphs on the same screen. If that
> matters, there's a program called ConEmu that hides the console and
> proxies its screen and input buffers to drive an improved interface
> that has flexible font support, ANSI/VT100 terminal emulation, and
> tabs. If you pair that with win_unicode_console, it's almost as good
> as a Linux terminal, but the number of hoops you have to go through to
> make it all work is too complicated.

So windows uses the following (Western locales):
console: cp437 (OEM codepage)
"bytes": cp1252 (ANSI codepage)
unicode: utf-16-le (is 'mbcs' equivalent to utf-16-*?)

Sheesh, so much room for errors. Why not everything utf-8, like in linux? 
Is cmd.exe that impopular that Microsoft does not replace it with something better?
I understand that this silly OEM codepage is a historical anomaly, but am I correct
in saying that the use of codepages (with stupid differences such as latin-1 vs cp1252 as a bonus)
are designed to hamper cross-platform compatibility (and force people to stick with windows)?

> Some people try to use UTF-8 (codepage 65001) in the ANSI API --
> ReadConsoleA/ReadFile and WriteConsoleA/WriteFile. But the console's
> UTF-8 support is dysfunctional. It's not designed to handle it.
> 
> In Windows 7, WriteFile calls WriteConsoleA, which decodes the buffer
> to UTF-16 using the current codepage and returns the number of UTF-16
> 'characters' written instead of the number of bytes. This confuses
> buffered writers. Say it writes a 20-byte UTF-8 string with 2 bytes
> per character. WriteFile returns that it successfully wrote 10
> characters, so the buffered writer tries to write the last 10 bytes
> again. This leads to a trail of garbage text written after every
> write.

Strange. I would have thought it writes the first 10 bytes (5 characters) and that the remaining 10 bytes end up in oblivion.

> When a program reads from the console using ReadFile or ReadConsoleA,
> the console's input buffer has to be encoded to the target codepage.
> It assumes that an ANSI character is 1 byte, so if you try to read N
> bytes, it tries to encode N characters. This fails for non-ASCII
> UTF-8, which has 2 to 4 bytes per character. However, it won't
> decrease the number of characters to fit in the N byte buffer. In the
> API the argument is named "nNumberOfCharsToRead", and they're sticking
> to that literally. The result is that 0 bytes are read, which is
> interpreted as EOF. So the REPL will quit, and input() will raise
> EOFError.