[issue18713] Enable surrogateescape on stdin and stdout when appropriate

STINNER Victor report at bugs.python.org
Wed Aug 21 12:38:53 CEST 2013


STINNER Victor added the comment:

Currently, Python 3 fails miserabily when it gets a non-ASCII
character from stdin or when it tries to write a byte encoded as a
Unicode surrogate to stdout.

It works fine when OS data can be decoded from and encoded to the
locale encoding. Example on Linux with UTF-8 data and UTF-8 locale
encoding:

$ mkdir test
$ cd test
$ touch héhé.txt
$ ls
héhé.txt
$ python3 -c 'import os; print(", ".join(os.listdir()))'
héhé.txt
$ echo "héhé"|python3 -c 'import sys; sys.stdout.write(sys.stdin.read())'|cat
héhé

It fails miserabily when OS data cannot be decoded from or encoded to
the locale encoding. Example on Linux with UTF-8 data and ASCII locale
encoding:

$ mkdir test
$ cd test
$ touch héhé.txt
$ export LANG=  # switch to ASCII locale encoding
$ ls
h??h??.txt
$ python3 -c 'import os; print(", ".join(os.listdir()))'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
1-2: ordinal not in range(128)

$ echo "héhé"|LANG= python3 -c 'import sys;
sys.stdout.write(sys.stdin.read())'|cat
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/vstinner/prog/python/default/Lib/encodings/ascii.py",
line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
1: ordinal not in range(128)

The ls output is not the expected "héhé" string, but it is an issue
with the console output, not the ls program. ls does just write raw
bytes to stdout:

$ ls|hexdump -C
00000000  68 c3 a9 68 c3 a9 2e 74  78 74 0a                 |h..h...txt.|
0000000b

("héhé" encoded to UTF-8 gives b'h\xc3\xa9h\xc3\xa9')

I agree that we can do something to improve the situation on standard
streams, but only on standard streams. It is already possible to
workaround the issue by forcing the surrogateescape error handler on
stdout:

$ LANG= PYTHONIOENCODING=utf-8:surrogateescape python3 -c 'import os;
print(", ".join(os.listdir()))'
héhé.txt

Something similar can be done in Python. For example,
test.support.regrtest reopens sys.stdout to set the error handle to
"backslashreplace". Extract of the replace_stdout() function:

    sys.stdout = open(stdout.fileno(), 'w',
        encoding=sys.stdout.encoding,
        errors="backslashreplace",
        closefd=False,
        newline='\n')

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue18713>
_______________________________________


More information about the Python-bugs-list mailing list