Python 3.7+ cannot print unicode characters when output is redirected to file - is this a bug?

Eryk Sun eryksun at gmail.com
Sun Nov 13 17:39:55 EST 2022


On 11/13/22, Jessica Smith <12jessicasmith34 at gmail.com> wrote:
> Consider the following code ran in Powershell or cmd.exe:
>
> $ python -c "print('└')"
>>
> $ python -c "print('└')" > test_file.txt
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
>   File "C:\Program Files\Python38\lib\encodings\cp1252.py", line 19, in
> encode
>     return codecs.charmap_encode(input,self.errors,encoding_table)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\u2514' in
> position 0: character maps to <undefined>

If your applications and existing data files are compatible with using
UTF-8, then in Windows 10+ you can modify the administrative regional
settings in the control panel to force using UTF-8. In this case,
GetACP() and GetOEMCP() will return CP_UTF8 (65001), and the reserved
code page constants CP_ACP (0),  CP_OEMCP (1), CP_MACCP (2), and
CP_THREAD_ACP (3) will use CP_UTF8.

You can override this on a per-application basis via the
ActiveCodePage setting in the manifest:

https://learn.microsoft.com/en-us/windows/win32/sbscs/application-manifests#activecodepage

In Windows 10, this setting only supports "UTF-8". In Windows 11, it
also supports "legacy" to allow old applications to run on a system
that's configured to use UTF-8.  Setting an explicit locale is also
supported in Windows 11, such as "en-US", with fallback to UTF-8 if
the given locale has no legacy code page.

Note that setting the system to use UTF-8 also affects the host
process for console sessions (i.e. conhost.exe or openconsole.exe),
since it defaults to using the OEM code page (UTF-8 in this case).
Unfortunately, a legacy read from the console host does not support
reading non-ASCII text as UTF-8. For example:

    >>> os.read(0, 6)
    SPĀM
    b'SP\x00M\r\n'

This is a trivial bug in the console host, which stems from the fact
that UTF-8 is a multibyte encoding (1-4 bytes per code), but for some
reason the console team at Microsoft still hasn't fixed it. You can
use chcp.com to set the console's input and output code pages to
something other than UTF-8 if you have to read non-ASCII input in a
legacy console app. By default, this problem doesn't affect Python's
sys.stdin, which internally uses wide-character ReadConsoleW() with
the system's native text encoding, UTF-16LE.


More information about the Python-list mailing list