[Python-ideas] changing sys.stdout encoding

Rurpy rurpy at yahoo.com
Wed Jun 6 08:17:18 CEST 2012


On 06/05/2012 05:34 PM, Victor Stinner wrote:
> 2012/6/5 Rurpy <rurpy at yahoo.com>:
>> In my first foray into Python3 I've encountered this problem:
>> I work in a multi-language environment.  I've written a number
>> of tools, mostly command-line, that generate output on stdout.
>> Because these tools and their output are used by various people
>> in varying environments, the tools all have an --encoding option
>> to provide output that meets the needs and preferences of the
>> output's ultimate consumers.
> 
> What happens if the specified encoding is different than the encoding
> of the console? Mojibake?

When output is directed to te console, yes.  Would one 
expect something else?

> If the output is used as in the input of another program, does the
> other program use the same encoding?

Yes of course (when not misused).  That's why they have 
--encoding options.  (Obviously details vary depending on 
requirements of the various tools.)

> In my experience, using an encoding different than the locale encoding
> for input/output (stdout, environment variables, command line
> arguments, etc.) causes various issues. So I'm curious of your use
> cases.

I gave the use case in my original post:

  + I work in a multi-language environment.  I've written a number 
  + of tools, mostly command-line, that generate output on stdout.
  + Because these tools and their output are used by various people
  + in varying environments, the tools all have an --encoding option
  + to provide output that meets the needs and preferences of the
  + output's ultimate consumers. 

They are often used like:
  ./extractor.py --encoding=euc-jp dataset >somefile
  <send somefile to some user who uses euc-jp data> 

And of course some tools require something similar for 
stdin encodings.

>> In converting them to Python3, I found the best (if not very
>> pleasant) way to do this in Python3 was to put something like
>> this near the top of each tool[*1]:
>>
>>  import codecs
>>  sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer)
> 
> In Python 3, you should use io.TextIOWrapper instead of
> codecs.StreamWriter. It's more efficient and has less bugs.

Thanks, I'll do that.

But surely this is a strong argument for encapsulating 
the ability to change (or reinitialize) the std* encodings.

I did fair amount of searching on the internet (many orders
of magnitude more time that it would have taken to look up 
sys.stdout.set_encoding() in the documentation) and *still*
ended up with a suboptimal solution.  

>> What I want to be able to put there instead is:
>>
>>  sys.stdout.set_encoding (opts.encoding)
> 
> I don't think that your use case merit a new method on
> io.TextIOWrapper: replacing sys.stdout does work and should be used
> instead. TextIOWrapper is generic and your use case if specific to
> sys.std* streams.
> 
> It would be surprising to change the encoding of an arbitrary file
> after it is opened. At least, I don't see the use case.

I gave a couple that I encountered in the past, in my response 
to Steven Turnbull.  However, now I am more concerned with
just resetting the encoding at the beginning of the program.

> For example, tokenize.open() opens a Python source code file with the
> right encoding. It starts by reading the file in binary mode to detect
> the encoding, and then use TextIOWrapper to get a text file without
> having to reopen the file. It would be possible to start with a text
> file and then change the encoding, but it would be less elegant.

That's a rather different use case than mine, yes?

>>  sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer)
> 
> You should also flush sys.stdout (and maybe also sys.stdout.buffer)
> before replacing it.
> 
>> It requires the import of the codecs module in programs that other-
>> wise don't need it [*2], and the reading of the codecs docs (not
>> a shining example of clarity themselves) to understand it.
> 
> It's maybe difficult to change the encoding of sys.stdout at runtime
> because it is NOT a good idea :-)

Why would that be?  My tools already do that, they meet 
their usability requirements and I have noticed no ill
effects.  The code (except for the piece I am complaining
about) is about as simple and obvious as it is possible to 
get.  Am I missing something?
 
>> Needing to change the encoding of a sys.std* stream is not an
>> uncommon need and a user should not have to go through the
>> codecs dance above to do so IMO.
> 
> Replacing sys.std* works but has issues: output written before the
> replacement is encoded to a different encoding for example. The best
> way is to change your locale encoding (using LC_ALL, LC_CTYPE or LANG
> environment variable on UNIX), or simply to set PYTHONIOENCODING
> environment variable.

Those solutions are not only NOT the best solution (IMO) -- 
they are completely unacceptable.

If I had to build my programs as shell scripts that manipulate 
environment variables before calling my Python program, I would 
dump Python for some other language. 
 
>> [*1] There are other ways to change stdout's encoding but they
>>  all have problems AFAICT.  PYTHONIOENCODING can't easily be
>>  changed dynamically within program.
> 
> Ah? Detect if PYTHONIOENCODING is present (or if sys.stdout.encoding
> is the requested encoding), if not: restart the program with
> PYTHONIOENCODING=encoding.

For what I need to do (print() to sys.stdout with a different
encoding than what Python guessed I'd want), your proposal seems
absurdly convoluted to me.

sys.stdout is set to encoding A.  I want it to write using 
encoding B.  The obvious, simplest, most desirable solution 
(barring technical difficulties) is just change the encoding.

>>  Overloading print() is obscure
>>  because it requires reader to notice print was overloaded.
> 
> Why not writing the output into a file, instead of stdout?

Because the interface for these tools already exists and
the users of the tools are happy with them the way they are.

And even if that weren't the case, it is not the role of a 
general purpose programming language to say a standard convention 
such as file redirection should be relegated to second-class 
status simply because the programmer needs a different output 
encoding than the language designers thought he would.




More information about the Python-ideas mailing list