[issue44774] incorrect sys.stdout.encoding within a io.StringIO buffer
Steven D'Aprano
report at bugs.python.org
Fri Jul 30 07:56:44 EDT 2021
Steven D'Aprano <steve+python at pearwood.info> added the comment:
> I expect sys.stdout to have utf-8 encoding inside the redirect because
> the buffer accepts unicode code points (not bytes)
And the buffer stores unicode code points, not bytes, so why would there
be an encoding?
Just to get this out of the way, in case you are thinking along these
lines. Python strings are not arrays of UTF-8 bytes, like Go runes.
Python strings are arrays of abstract code points.
The specific details will vary from interpreter to interpreter, and from
version to version, but current versions of CPython use a flexible
in-memory representation where the width of the code points (1, 2 or 4
bytes) depend on the string. This is not UTF-8: the bytes are encoded as
Latin-1, UCS-2, or UTF-32 depending on the string.
> For some reason, the encoding of a StringIO object is None
Because StringIO objects store strings, not bytes. There is no encoding
involved. The inputs are strings, and the storage is strings.
> which is inconsistent with its semantics: it should be 'uft-8'.
It is completely consistent: the encoding should be None, because there
is no encoding.
> I expect the 'encoding' attribute of sys.stdout to have the same value
> inside and outside this redirect.
Why? If you redirect to an actual file using, let's say Mac-Roman
encoding, or ASCII, or UTF-32, or any one of dozens of other encodings,
you should expect the encoding inside the block to reflect the actual
encoding used inside the block.
The encoding outside the block is the encoding used by the original
stdout; the encoding inside the block is the encoding used by the
replacement stdout. Why would you expect them to always be the same?
>>> print("outside:", sys.stdout.encoding)
outside: utf-8
>>> with open("/tmp/junk.txt", "w", encoding="ascii") as f:
... with redirect_stdout(f):
... print("inside:", sys.stdout.encoding)
...
>>> with open("/tmp/junk.txt", encoding="ascii") as f:
... print(f.read())
...
inside: ascii
> It so happens that sys.stdout is an io.StringIO() object inside the
> redirect. The getvalue() method on this object returns a string (not
> a bytes), i.e. a sequence of unicode points.
Exactly. And that is why there is no encoding involved. It is purely a
sequence of Unicode code points, not bytes, and at no point was a
Unicode string encoded to bytes to go to the filesystem.
> StringIO inherits from TextIOBase, which has an 'encoding' attribute.
And StringIO has an encoding attribute because of inheritance, and it is
set to None because there is no actual encoding codec used.
----------
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue44774>
_______________________________________
More information about the Python-bugs-list
mailing list