[issue45120] Windows cp encodings "UNDEFINED" entries update

Eryk Sun report at bugs.python.org
Fri Sep 17 18:34:22 EDT 2021


Eryk Sun <eryksun at gmail.com> added the comment:

Rafael, I was discussing code_page_decode() and code_page_encode() both as an alternative for compatibility with other programs and also to explore how MultiByteToWideChar() and WideCharToMultiByte() work -- particularly to explain best-fit mappings, which do not roundtrip. MultiByteToWideChar() does not exhibit "best fit" behavior. I don't even know what that would mean in the context of decoding. 

With the exception of one change to code page 1255, the definitions that you're looking to add are just for the C1 controls and private use area codes, which are not meaningful. Windows uses these arbitrary definitions to be able to roundtrip between the system ANSI and Unicode APIs.

Note that Python's "mbcs" (i.e. "ansi") and "oem" encodings use the code-page codec. For example:

    >>> _winapi.GetACP()
    1252

    >>> '\x81\x8d\x8f\x90\x9d'.encode('ansi')
    b'\x81\x8d\x8f\x90\x9d'

Best-fit encode "α" in code page 1252 [1]:

    >>> 'α'.encode('ansi', 'replace')
    b'a'

In your PR, the change to code page 1255 to add b"\xca" <-> "\u05ba" is the only change that I think is really worthwhile because the unicode.org data has it wrong. You can get the proper character name for the comment using the unicodedata module:

    >>> print(unicodedata.name('\u05ba'))
    HEBREW POINT HOLAM HASER FOR VAV

I'm +0 in favor of leaving the mappings undefined where Windows completes legacy single-byte code pages by using C1 control codes and private use area codes. It would have been fine if Python's code-page encodings had always been based on the "WindowsBestFit" tables, but only the decoding MBTABLE, since it's reasonable. 

Ideally, I don't want anything to use the best-fit mappings in WCTABLE. I would rather that the 'replace' handler for code_page_encode() used the replacement character (U+FFFD) or system default character. But the world is not ideal; the system ANSI API uses the WCTABLE best-fit encoding. Back in the day with Python 2.7, it was easy to demonstrate how insidious this is. For example, in 2.7.18:

    >>> os.listdir(u'.')
    [u'\u03b1']

    >>> os.listdir('.')
    ['a']

---

[1] https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue45120>
_______________________________________


More information about the Python-bugs-list mailing list