[Python-Dev] Unicode input issues

Guido van Rossum guido@python.org
Mon, 10 Apr 2000 11:38:58 -0400


> > Finally, I believe we need a way to discover the encoding used by
> > stdin or stdout.  I have to admit I know very little about the file
> > wrappers that Marc wrote -- is it easy to get the encoding out of
> > them? 
> 
> I'm not sure what you mean: the name of the input encoding ?
> Currently, only the names of the encoding and decoding functions
> are available to be queried.

Whatever is helpful for a module or program that wants to know what
kind of encoding is used.

> > IDLE should probably emulate this, as it's encoding is clearly
> > UTF-8 (at least when using Tcl 8.1 or newer).
> 
> It should be possible to redirect sys.stdin/stdout using
> the codecs.EncodedFile wrapper. Some tests show that raw_input()
> doesn't seem to use the redirected sys.stdin though...
> 
> >>> sys.stdin = EncodedFile(sys.stdin, 'utf-8', 'latin-1')
> >>> s = raw_input()
> äöü
> >>> s
> '\344\366\374'
> >>> s = sys.stdin.read()
> äöü
> >>> s
> '\303\244\303\266\303\274\012'

This deserves more looking into.  The code for raw_input() in
bltinmodule.c certainly *tries* to use sys.stdin.  (I think that
because your EncodedFile object is not a real stdio file object, it
will take the second branch, near the end of the function; this calls
PyFile_GetLine() which attempts to call readline().)

Aha!  It actually seems that your read() and readline() are
inconsistent!

I don't know your API well enough to know which string is "correct"
(\344\366\374 or \303\244\303\266\303\274) but when I call
sys.stdin.readline() I get the same as raw_input() returns:

  >>> from codecs import *
  >>> sys.stdin = EncodedFile(sys.stdin, 'utf-8', 'latin-1')
  >>> s = raw_input()
  äöü
  >>> s
  '\344\366\374'
  >>> s = sys.stdin.read()
  äöü
  >>> 
  >>> s
  '\303\244\303\266\303\274\012'
  >>> unicode(s)
  u'\344\366\374\012'
  >>> s = sys.stdin.readline()
  äöü
  >>> s
  '\344\366\374\012'
  >>>

Didn't you say that your wrapper only wraps read()?  Maybe you need to
revise that decision!

(Note that PyShell doesn't even define read() -- it only defines
readline().)

--Guido van Rossum (home page: http://www.python.org/~guido/)