[Tutor] Changing the interpreter prompt symbol from ">>>" to ???

eryk sun eryksun at gmail.com
Tue Mar 15 20:15:03 EDT 2016


On Tue, Mar 15, 2016 at 2:51 PM, Albert-Jan Roskam
<sjeik_appie at hotmail.com> wrote:
>
> So windows uses the following (Western locales):
> console: cp437 (OEM codepage)
> "bytes": cp1252 (ANSI codepage)

The console defaults to the OEM codepage, but you can separately
switch the input and output to different codepages. This is an
exception to the rule, as otherwise the system codepage used in the
[A]NSI API is fixed when the system boots. Changing it requires
modifying the system locale and rebooting.

> unicode: utf-16-le (is 'mbcs' equivalent to utf-16-*?)

The native Unicode encoding of Windows is UTF-16LE. This is what gets
used in the kernel, device drivers, and filesystems. UTF-16 was
created to accommodate the early adopters of 16-bit Unicode, such as
Windows. When you call an ANSI API, such as CreateFileA, the bytes
argument(s) get decoded to UTF-16, and then it calls the corresponding
wide-character function, such as CreateFileW (or maybe a common
internal function, but that's an implementation detail). ANSI is a
legacy API, and it's moving towards deprecation and obsolescence. New
WinAPI functions are often created with only wide-character support.

MBCS (multibyte character set) refers to encoding that can be used in
the system locale for the [A]NSI API. While technically UTF-8 and
UTF-16 are multibyte encodings, they're not allowed in the legacy ANSI
API. That said, because the console is just plain weird, it allows
setting its input and output codepages to UTF-8, even though the
result is often buggy.

> Sheesh, so much room for errors. Why not everything utf-8, like in linux?

NT was developed before UTF-8 was released, so you're asking the NT
team to invent a time machine. Plus there's nothing really that
horrible about UTF-16. On the plus side, it uses only 2 bytes per
character for all characters in the BMP, whereas UTF-8 uses 3 bytes
per character for 61440 out of 63488 characters in the BMP (not
including the surrogate-pair block, U+D800-U+DFFF). On the plus side
for UTF-8, it encodes ASCII (i.e. ordinals less than 128) in a single
byte per character.

> Is cmd.exe that impopular that Microsoft does not replace it with something
> better?

cmd.exe is a shell, like powershell.exe or bash.exe, and a console
client application just like python.exe. cmd.exe doesn't host the
console, nor does it have anything to do with the console subsystem
other than being a client of it. When you run python.exe from cmd.exe,
all cmd does is wait for python to exit.

When you run a program that's flagged as a console application, the
system either attaches an inherited console if one exists, or opens
and attaches a new console. This window is hosted by an instance of
conhost.exe. It also implements the application's command-line
editing, input history buffer (e.g. F7), and input aliases, separately
for each attached executable. This part of the console API is
accessible from the command line via doskey.exe.

cmd.exe is a Unicode application, so codepages aren't generally of
much concern to it, except it defaults to encoding to the console
codepage when its built-in commands such as "dir" and "set" are piped
to another program. You can force it to use UTF-16 in this case by
running cmd /U. This is just for cmd's internal commands. What
external commands write to a pipe is up to them. For example, Python 3
defaults to using the ANSI codepage when stdio is a pipe. You can
override this via the environment variable PYTHONIOENCODING.

As to replacing the console, I doubt that will happen. Microsoft has
little incentive to invest in improving/replacing the console and
command-line applications. Windows administration has shifted to
PowerShell scripting and cmdlets.

> am I correct in saying that the use of codepages (with stupid differences
> such as latin-1 vs cp1252 as a bonus) are designed to hamper cross-
> platform compatibility (and force people to stick with windows)?

The difference between codepage 1252 and Latin-1 is historical.
Windows development circa 1990 was following a draft ANSI standard for
character encodings, which later became the ISO 8859-x encodings.
Windows codepages ended up deviating from the ISO standard. Plus the
'ANSI' API also supports MBCS codepages for East-Asian languages. I
can't speak to any nefarious business plans to hamper cross-platform
compatibility. But it seems to me that this is a matter of rushing to
get a product to market without having time to wait for a standards
committee, plus a good measure of arrogance that comes from having
market dominance.

> Strange. I would have thought it writes the first 10 bytes (5 characters)
> and that the remaining 10 bytes end up in oblivion.

Maybe a few more details will help to clarify the matter. In Windows
7, WriteFile basically calls WriteConsoleA, which makes a local
procedure call (LPC) to SrvWriteConsole in conhost.exe. This call is
flagged as to whether the buffer is ANSI or Unicode (UTF-16). If it's
ANSI, the console first decodes the buffer using MultiByteToWideChar
according to the output screen's codepage. Then it copies the decoded
buffer to the output screen buffer and returns to the caller how many
UTF-16 *code points* it wrote. Maybe that's fine for WriteConsoleA,
but returning that number to a WriteFile caller (a bytes API) is
nonsense. If I write a 20-byte buffer, I need to know that all 20
bytes were written, not that 10 UTF-16 codes were written. This causes
problems with buffered writers such as Python 3's BufferedWriter
class.


More information about the Tutor mailing list