[Python-Dev] Unicode literals in Python 2.7
Adam Bartoš
drekin at gmail.com
Fri May 1 11:43:09 CEST 2015
On Fri, May 1, 2015 at 6:14 AM, Stephen J. Turnbull <stephen at xemacs.org>
wrote:
> Adam Bartoš writes:
>
> > Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the
> > sys.std* streams are created with utf-8 encoding (which doesn't
> > help on Windows since they still don't use ReadConsoleW and
> > WriteConsoleW to communicate with the terminal) and after changing
> > the sys.std* streams to the fixed ones and setting readline hook,
> > it still doesn't work,
>
> I don't see why you would expect it to work: either your code is
> bypassing PYTHONIOENCODING=utf-8 processing, and that variable doesn't
> matter, or you're feeding already decoded text *as UTF-8* to your
> module which evidently expects something else (UTF-16LE?).
>
I'll describe my picture of the situation, which might be terribly wrong.
On Linux, in a typical situation, we have a UTF-8 terminal,
PYTHONENIOENCODING=utf-8, GNU readline is used. When the REPL wants input
from a user the tokenizer calls PyOS_Readline, which calls GNU readline.
The user is prompted >>> , during the input he can use autocompletion and
everything and he enters u'α'. PyOS_Readline returns b"u'\xce\xb1'" (as
char* or something), which is UTF-8 encoded input from the user. The
tokenizer, parser, and evaluator process the input and the result is
u'\u03b1', which is printed as an answer.
In my case I install custom sys.std* objects and a custom readline hook.
Again, the tokenizer calls PyOS_Readline, which calls my readline hook,
which calls sys.stdin.readline(), which returns an Unicode string a user
entered (it was decoded from UTF-16-LE bytes actually). My readline hook
encodes this string to UTF-8 and returns it. So the situation is the same.
The tokenizer gets b"\u'xce\xb1'" as before, but know it results in
u'\xce\xb1'.
Why is the result different? I though that in the first case
PyCF_SOURCE_IS_UTF8 might have been set. And after your suggestion, I
thought that PYTHONIOENCODING=utf-8 is the thing that also sets
PyCF_SOURCE_IS_UTF8.
> > so presumably the PyCF_SOURCE_IS_UTF8 is still not set.
>
> I don't think that flag does what you think it does. AFAICT from
> looking at the source, that flag gets unconditionally set in the
> execution context for compile, eval, and exec, and it is checked in
> the parser when creating an AST node. So it looks to me like it
> asserts that the *internal* representation of the program is UTF-8
> *after* transforming the input to an internal representation (doing
> charset decoding, removing comments and line continuations, etc).
>
I thought it might do what I want because of the behaviour of eval. I
thought that the PyUnicode_AsUTF8String call in eval just encodes the
passed unicode to UTF-8, so the situation looks like follows:
eval(u"u'\u031b'") -> (b"u'\xce\xb1'", PyCF_SOURCE_IS_UTF8 set) -> u'\u03b1'
eval(u"u'\u031b'".encode('utf-8')) -> (b"u'\xce\xb1'", PyCF_SOURCE_IS_UTF8
not set) -> u'\xce\xb1'
But of course, this my picture might be wrong.
> Well, the received text comes from sys.stdin and its encoding is
> > known.
>
> How? You keep asserting this. *You* know, but how are you passing
> that information to *the Python interpreter*? Guido may have a time
> machine, but nobody claims the Python interpreter is telepathic.
>
I thought that the Python interpreter knows the input comes from sys.stdin
at least to some extent because in pythonrun.c:PyRun_InteractiveOneObject
the encoding for the tokenizer is inferred from sys.stdin.encoding. But
this is actually the case only in Python 3. So I was wrong.
> Yes. In the latter case, eval has no idea how the bytes given are
> > encoded.
>
> Eval *never* knows how bytes are encoded, not even implicitly. That's
> one of the important reasons why Python 3 was necessary. I think you
> know that, but you don't write like you understand the implications
> for your current work, which makes it hard to communicate.
>
Yes, eval never knows how bytes are encoded. But I meant it in comparison
with the first case where a Unicode string was passed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20150501/3533842c/attachment.html>
More information about the Python-Dev
mailing list