[issue44774] incorrect sys.stdout.encoding within a io.StringIO buffer

Steven D'Aprano report at bugs.python.org
Fri Jul 30 07:56:44 EDT 2021


Steven D'Aprano <steve+python at pearwood.info> added the comment:

> I expect sys.stdout to have utf-8 encoding inside the redirect because 
> the buffer accepts unicode code points (not bytes)

And the buffer stores unicode code points, not bytes, so why would there 
be an encoding?

Just to get this out of the way, in case you are thinking along these 
lines. Python strings are not arrays of UTF-8 bytes, like Go runes. 
Python strings are arrays of abstract code points.

The specific details will vary from interpreter to interpreter, and from 
version to version, but current versions of CPython use a flexible 
in-memory representation where the width of the code points (1, 2 or 4 
bytes) depend on the string. This is not UTF-8: the bytes are encoded as 
Latin-1, UCS-2, or UTF-32 depending on the string.

> For some reason, the encoding of a StringIO object is None

Because StringIO objects store strings, not bytes. There is no encoding 
involved. The inputs are strings, and the storage is strings.

> which is inconsistent with its semantics: it should be 'uft-8'.

It is completely consistent: the encoding should be None, because there 
is no encoding.

> I expect the 'encoding' attribute of sys.stdout to have the same value 
> inside and outside this redirect.

Why? If you redirect to an actual file using, let's say Mac-Roman 
encoding, or ASCII, or UTF-32, or any one of dozens of other encodings, 
you should expect the encoding inside the block to reflect the actual 
encoding used inside the block.

The encoding outside the block is the encoding used by the original 
stdout; the encoding inside the block is the encoding used by the 
replacement stdout. Why would you expect them to always be the same?

>>> print("outside:", sys.stdout.encoding)
outside: utf-8
>>> with open("/tmp/junk.txt", "w", encoding="ascii") as f:
...     with redirect_stdout(f):
...             print("inside:", sys.stdout.encoding)
... 
>>> with open("/tmp/junk.txt", encoding="ascii") as f:
...     print(f.read())
... 
inside: ascii

> It so happens that sys.stdout is an io.StringIO() object inside the 
> redirect.  The getvalue() method on this object returns a string (not 
> a bytes), i.e. a sequence of unicode points.

Exactly. And that is why there is no encoding involved. It is purely a 
sequence of Unicode code points, not bytes, and at no point was a 
Unicode string encoded to bytes to go to the filesystem.

> StringIO inherits from TextIOBase, which has an 'encoding' attribute.  

And StringIO has an encoding attribute because of inheritance, and it is 
set to None because there is no actual encoding codec used.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue44774>
_______________________________________


More information about the Python-bugs-list mailing list