Text-mode apps (Was :Who are the "spacists"?)

Chris Angelico rosuav at gmail.com
Sun Mar 26 14:57:31 EDT 2017


On Mon, Mar 27, 2017 at 5:37 AM, eryk sun <eryksun at gmail.com> wrote:
> On Sun, Mar 26, 2017 at 5:58 PM, Chris Angelico <rosuav at gmail.com> wrote:
>>> The Windows console can render any character in the BMP, but it
>>> requires configuring font linking for fallback fonts. It's Windows, so
>>> of course the supported UTF format is UTF-16. The console's UTF-8
>>> support (codepage 65001) is too buggy to even consider using it.
>>
>> Is it actually UTF-16, or is it UCS-2?
>
> Pedantically speaking it's UCS-2. Console buffers aren't necessarily
> valid UTF-16, i.e. they can have lone surrogate codes or invalid
> surrogate pairs. The way a surrogate code gets rendered depends on the
> font. It could be an empty box, a box containing a question mark, or
> simply empty space. That applies even if it's a valid UTF-16 surrogate
> pair, so the console can't display non-BMP characters such as emojis.
> They can be copied to the clipboard and displayed in another program.

Exactly. So it's not supporting the entire Unicode range, but only the
BMP. That restricts its usability for anything other than simple text.

> Windows file systems are also UCS-2. For the most part it's not an
> issue since the source of text and filenames will be valid UTF-16.

I'm actually not sure on that one. Poking around on both Stack
Overflow and MSDN suggests that NTFS does actually use UTF-16, which
implies that lone surrogates should be errors, but I haven't proven
this. In any case, file system encoding is relatively immaterial; it's
file system *API* encoding that matters, and that means the
CreateFileW function and its friends:

https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx
https://msdn.microsoft.com/en-us/library/gg269344(v=exchg.10).aspx

My reading of this is that:

a) The API is defined in terms of the WCHAR type, a 16-bit code unit.
b) Limits are described in terms of "characters" (eg a max of 32767
for a path that starts "\\?\")
c) ???
d) Profit??

I *think* it's the naive (and very common) hybrid of UCS-2 and UTF-16
that says "surrogates are allowed anywhere, and you're allowed to
interpret pairs of them as UTF-16". But that's not a standard
encoding. In actual UCS-2, surrogates are entirely disallowed; in
UTF-16, they *must* be correctly paired.

ChrisA



More information about the Python-list mailing list