Unicode failure

Sat Dec 5 19:19:19 EST 2015

On Sat, Dec 5, 2015 at 4:03 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 12/5/2015 2:44 PM, Random832 wrote:
>> As someone else pointed out, I meant that as a list of codepages
>> which support all Unicode codepoints, not a list of codepoints
>> not supported by Tk's UCS-2.  Sorry, I assumed everyone knew
>> offhand that 65001 was UTF-8
>
> So Microsoft claims, but it is not terribly useful.

Using codepage 65001 is how one encodes/decodes UTF-8 using the
Windows API, i.e. WideCharToMultiByte and MultiByteToWideChar.

If you're just referring to the console, then I agree for the most
part. The console, even in Windows 10, still has two major flaws when
using UTF-8. The biggest problem is that non-ASCII input gets read as
EOF (i.e. 0 bytes read) because of a bug in how conhost.exe (the
process that hosts the console) converts its internal input buffer.
Instead of dynamically determining how many characters to encode based
on the current codepage, it assumes an N byte user buffer is a request
for N characters, which obviously fails with non-ASCII UTF-8. What's
worse is that it doesn't fail the call. It returns to the client that
it successfully read 0 bytes.This causes Python's REPL to quit and
input() to raise EOFError.

The 2nd problem that still exists in Windows 10 is that the console
doesn't save state across writes, so a 2-4 byte UTF-8 code sequence
that gets split into 2 writes due to buffering gets displayed in the
console as 2-4 replacement characters (i.e. U+FFFD). Most POSIX
terminals don't suffer from this problem because they natively use
8-bit strings, whereas Windows transcodes to UTF-16.

Prior to Windows 8, there's another annoying bug. WriteFile and
WriteConsoleA return the number of wchar_t elements written instead of
the number of bytes written. So a buffered writer will write
successively smaller slices of the output buffer until the two numbers
agree. You end up with a (potentially long) trail of garbage at the
end of every write that contains non-ASCII characters.

Since Windows doesn't allow UTF-8 as the system codepage (i.e. the
[A]NSI API), it's probably only by accident that UTF-8 works in the
console at all. Unicode works best (though not perfectly) via the
console's wide-character API. The win-unicode-console package provides
this functionality for Python 2 and 3.

> Currently, on my Win 10 system, 'chcp 65001' results in
> sys.stdout.encoding = 'cp65001', and
>
> for cp in 1200, 1201, 12000, 12001, 65000, 65001, 54936:
>     print(chr(cp))
> running without the usual exception.  But of the above numbers
> mis-interpreted as codepoints, only 1200 and 1201 print anything other than
> a box with ?, whereas IDLE printed 3 other chars for 3 other assigned
> codepoints. If I change the console font to Lucida Console, which I use in
> IDLE, even chr(1200) gives a box.

65000 and 65001 aren't characters. Code points 12000, 12001 and 54936
are East-Asian characters:

    >>> from unicodedata import name, east_asian_width
    >>> for n in (12000, 12001, 54936):
    ...     c = chr(n)
    ...     print(n, east_asian_width(c), name(c))
    ...
    12000 W CJK RADICAL C-SIMPLIFIED EAT
    12001 W CJK RADICAL HEAD
    54936 W HANGUL SYLLABLE HOELS

The console window can't mix narrow glyphs with wide glyphs. Its font
rendering still has mostly the same limitations that it had when it
debuted in Windows NT 3.1 (1993). To display wide CJK glyphs in the
console, set the system locale to an East-Asian region and restart
Windows (what a piece of... cake). The console also stores only one
16-bit wchar_t code per character cell, so a UTF-16 surrogate pair
representing a non-BMP character (e.g. one of the popular emoji
characters) displays as two rectangle glyphs. However, at least the
code values are preserved when copied from the console to a window
that displays UTF-16 text properly.

Alternatively, use ConEmu [1] to hide the original console and display
its contents in a window that handles text more flexibly. It also
hacks the console API via DLL injection to work around bugs and
provide Xterm emulation.

[1]: http://conemu.github.io