[Python-Dev] Unicode input issues
Guido van Rossum
guido@python.org
Mon, 10 Apr 2000 11:38:58 -0400
> > Finally, I believe we need a way to discover the encoding used by
> > stdin or stdout. I have to admit I know very little about the file
> > wrappers that Marc wrote -- is it easy to get the encoding out of
> > them?
>
> I'm not sure what you mean: the name of the input encoding ?
> Currently, only the names of the encoding and decoding functions
> are available to be queried.
Whatever is helpful for a module or program that wants to know what
kind of encoding is used.
> > IDLE should probably emulate this, as it's encoding is clearly
> > UTF-8 (at least when using Tcl 8.1 or newer).
>
> It should be possible to redirect sys.stdin/stdout using
> the codecs.EncodedFile wrapper. Some tests show that raw_input()
> doesn't seem to use the redirected sys.stdin though...
>
> >>> sys.stdin = EncodedFile(sys.stdin, 'utf-8', 'latin-1')
> >>> s = raw_input()
> äöü
> >>> s
> '\344\366\374'
> >>> s = sys.stdin.read()
> äöü
> >>> s
> '\303\244\303\266\303\274\012'
This deserves more looking into. The code for raw_input() in
bltinmodule.c certainly *tries* to use sys.stdin. (I think that
because your EncodedFile object is not a real stdio file object, it
will take the second branch, near the end of the function; this calls
PyFile_GetLine() which attempts to call readline().)
Aha! It actually seems that your read() and readline() are
inconsistent!
I don't know your API well enough to know which string is "correct"
(\344\366\374 or \303\244\303\266\303\274) but when I call
sys.stdin.readline() I get the same as raw_input() returns:
>>> from codecs import *
>>> sys.stdin = EncodedFile(sys.stdin, 'utf-8', 'latin-1')
>>> s = raw_input()
äöü
>>> s
'\344\366\374'
>>> s = sys.stdin.read()
äöü
>>>
>>> s
'\303\244\303\266\303\274\012'
>>> unicode(s)
u'\344\366\374\012'
>>> s = sys.stdin.readline()
äöü
>>> s
'\344\366\374\012'
>>>
Didn't you say that your wrapper only wraps read()? Maybe you need to
revise that decision!
(Note that PyShell doesn't even define read() -- it only defines
readline().)
--Guido van Rossum (home page: http://www.python.org/~guido/)