[Python-Dev] Unicode literals in Python 2.7

Adam Bartoš drekin at gmail.com
Fri May 1 11:43:09 CEST 2015


On Fri, May 1, 2015 at 6:14 AM, Stephen J. Turnbull <stephen at xemacs.org>
wrote:

> Adam Bartoš writes:
>
>  > Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the
>  > sys.std* streams are created with utf-8 encoding (which doesn't
>  > help on Windows since they still don't use ReadConsoleW and
>  > WriteConsoleW to communicate with the terminal) and after changing
>  > the sys.std* streams to the fixed ones and setting readline hook,
>  > it still doesn't work,
>
> I don't see why you would expect it to work: either your code is
> bypassing PYTHONIOENCODING=utf-8 processing, and that variable doesn't
> matter, or you're feeding already decoded text *as UTF-8* to your
> module which evidently expects something else (UTF-16LE?).
>

I'll describe my picture of the situation, which might be terribly wrong.
On Linux, in a typical situation, we have a UTF-8 terminal,
PYTHONENIOENCODING=utf-8, GNU readline is used. When the REPL wants input
from a user the tokenizer calls PyOS_Readline, which calls GNU readline.
The user is prompted >>> , during the input he can use autocompletion and
everything and he enters u'α'. PyOS_Readline returns b"u'\xce\xb1'" (as
char* or something), which is UTF-8 encoded input from the user. The
tokenizer, parser, and evaluator process the input and the result is
u'\u03b1', which is printed as an answer.

In my case I install custom sys.std* objects and a custom readline hook.
Again, the tokenizer calls PyOS_Readline, which calls my readline hook,
which calls sys.stdin.readline(), which returns an Unicode string a user
entered (it was decoded from UTF-16-LE bytes actually). My readline hook
encodes this string to UTF-8 and returns it. So the situation is the same.
The tokenizer gets b"\u'xce\xb1'" as before, but know it results in
u'\xce\xb1'.

Why is the result different? I though that in the first case
PyCF_SOURCE_IS_UTF8 might have been set. And after your suggestion, I
thought that PYTHONIOENCODING=utf-8 is the thing that also sets
PyCF_SOURCE_IS_UTF8.



>  > so presumably the PyCF_SOURCE_IS_UTF8 is still not set.
>
> I don't think that flag does what you think it does.  AFAICT from
> looking at the source, that flag gets unconditionally set in the
> execution context for compile, eval, and exec, and it is checked in
> the parser when creating an AST node.  So it looks to me like it
> asserts that the *internal* representation of the program is UTF-8
> *after* transforming the input to an internal representation (doing
> charset decoding, removing comments and line continuations, etc).
>

I thought it might do what I want because of the behaviour of eval. I
thought that the PyUnicode_AsUTF8String call in eval just encodes the
passed unicode to UTF-8, so the situation looks like follows:
eval(u"u'\u031b'") -> (b"u'\xce\xb1'", PyCF_SOURCE_IS_UTF8 set) -> u'\u03b1'
eval(u"u'\u031b'".encode('utf-8')) -> (b"u'\xce\xb1'", PyCF_SOURCE_IS_UTF8
not set) -> u'\xce\xb1'
But of course, this my picture might be wrong.


 > Well, the received text comes from sys.stdin and its encoding is
>  > known.
>
> How?  You keep asserting this.  *You* know, but how are you passing
> that information to *the Python interpreter*?  Guido may have a time
> machine, but nobody claims the Python interpreter is telepathic.
>

I thought that the Python interpreter knows the input comes from sys.stdin
at least to some extent because in pythonrun.c:PyRun_InteractiveOneObject
the encoding for the tokenizer is inferred from sys.stdin.encoding. But
this is actually the case only in Python 3. So I was wrong.


 > Yes. In the latter case, eval has no idea how the bytes given are
>  > encoded.
>
> Eval *never* knows how bytes are encoded, not even implicitly.  That's
> one of the important reasons why Python 3 was necessary.  I think you
> know that, but you don't write like you understand the implications
> for your current work, which makes it hard to communicate.
>

Yes, eval never knows how bytes are encoded. But I meant it in comparison
with the first case where a Unicode string was passed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20150501/3533842c/attachment.html>


More information about the Python-Dev mailing list