eval and unicode

Thu Mar 20 18:39:20 EDT 2008

On Mar 20, 2:20 pm, Laszlo Nagy <gand... at shopzeus.com> wrote:
>
> >>  >>> eval( u'"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' ) == eval( '"徹底し
> >> たコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' )
> >> True
>
> > When you feed your unicode data into eval(), it doesn't have any
> > encoding or decoding work to do.
>
> Yes, but what about
>
> eval( 'u' + '"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' )
>

Let's take it apart, bit by bit:

'u' - A byte string with one byte, which is 117

'"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' - A byte string starting with " (34),
but then continuing in an unspecified byte sequence. I don't know what
encoding your terminal/file/whatnot is written in. Assuming it is in
UTF-8 and not UTF-16, then it would be the UTF-8 representation of the
unicode code points that follow.

Before you are passing it to eval, you are concatenating them. So now
you have a byte string that starts with u, then ", then something
beyond 128.

Now, when you are calling eval, you are passing in that byte string.
This byte string, it is important to emphasize, is not text. It is
text encoded in some format. Here is what my interpreter does (in a
UTF-8 console):

>>> u"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"
u'\u5fb9\u5e95\u3057\u305f\u30b3\u30b9\u30c8\u524a\u6e1b \xc1\xcd
\u0170\u0150\xdc\xd6\xda\xd3\xc9 \u0442\u0440\u0438\u0440\u043e
\u0432\u0430'

The first item in the sequence is \u5fb9 -- a unicode code point. It
is NOT a byte.

>>> eval( '"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' )
'\xe5\xbe\xb9\xe5\xba\x95\xe3\x81\x97\xe3\x81\x9f
\xe3\x82\xb3\xe3\x82\xb9\xe3\x83\x88\xe5\x89\x8a\xe6\xb8\x9b
\xc3\x81\xc3\x8d\xc5\xb0\xc5\x90\xc3\x9c\xc3\x96\xc3\x9a
\xc3\x93\xc3\x89 \xd1\x82\xd1\x80\xd0\xb8\xd1\x80\xd0\xbe
\xd0\xb2\xd0\xb0'

The first item in the sequence is \xe5. This IS a byte. This is NOT a
unicode point. It doesn't represent anything except what you want it
to represent.

>>> eval( 'u"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' )
u'\xe5\xbe\xb9\xe5\xba\x95\xe3\x81\x97\xe3\x81\x9f
\xe3\x82\xb3\xe3\x82\xb9\xe3\x83\x88\xe5\x89\x8a\xe6\xb8\x9b
\xc3\x81\xc3\x8d\xc5\xb0\xc5\x90\xc3\x9c\xc3\x96\xc3\x9a
\xc3\x93\xc3\x89 \xd1\x82\xd1\x80\xd0\xb8\xd1\x80\xd0\xbe
\xd0\xb2\xd0\xb0'

The first item in the sequence is \xe5. This is NOT a byte. This is a
unicode point-- LATIN SMALL LETTER A WITH RING ABOVE.

>>> eval( u'u"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' )
u'\u5fb9\u5e95\u3057\u305f\u30b3\u30b9\u30c8\u524a\u6e1b \xc1\xcd
\u0170\u0150\xdc\xd6\xda\xd3\xc9 \u0442\u0440\u0438\u0440\u043e
\u0432\u0430'

The first item in the sequence is \u5fb9, which is a unicode point.

In the Python program file proper, if you have your encoding setup
properly, the expression

  u"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"

is a perfectly valid expression. What happens is the Python
interpreter reads in that string of bytes between the quotes,
interprets them to unicode based on the encoding you already
specified, and creates a unicode object to represent that.

eval doesn't muck with encodings.

I'll try to address your points below in the context of what I just
wrote.

> The passed expression is not unicode. It is a "normal" string. A
> sequence of bytes.

Yes.

> It will be evaluated by eval, and eval should know
> how to decode the byte sequence.

You think eval is smarter than it is.

> Same way as the interpreter need to
> know the encoding of the file when it sees the u"徹底したコスト削減
> ÁÍŰŐÜÖÚÓÉ трирова" byte sequence in a python source file - before
> creating the unicode instance, it needs to be decoded (or not, depending
> on the encoding of the source).
>

Precisely. And it is. Before it is passed to eval/exec/whatever.

> String passed to eval IS python source, and it SHOULD have an encoding
> specified (well, unless it is already a unicode string, in that case
> this magic is not needed).
>

If it had an encoding specified, YOU should have decoded it and passed
in the unicode string.

> Consider this:
>
> exec("""
> import codecs
> s = u'Ű'
> codecs.open("test.txt","w+",encoding="UTF8").write(s)
> """)
>
> Facts:
>
> - source passed to exec is a normal string, not unicode
> - the variable "s", created inside the exec() call will be a unicode
> string. However, it may be Û or something else, depending on the
> source encoding. E.g. ASCII encoding it is invalid and exec() should
> raise a SyntaxError like:
>
> SyntaxError: Non-ASCII character '\xc5' in file c:\temp\aaa\test.py on
> line 1, but no encoding declared; seehttp://www.python.org/peps/pep-0263.htmlfor details
>
> Well at least this is what I think. If I'm not right then please explain
> why.
>

If you want to know what happens, you have to try it. Here's what
happens (again, in my UTF-8 terminal):

>>> exec("""
... import codecs
... s = u'Ű'
... codecs.open("test.txt","w+",encoding="UTF8").write(s)
... """)
>>> s
u'\xc5\xb0'
>>> print s
Å°
>>> file('test.txt').read()
'\xc3\x85\xc2\xb0'
>>> print file('test.txt').read()
Å°

Note that s is a unicode string with 2 unicode code points. Note that
the file has 4 bytes--since it is that 2-code sequence encoded in
UTF-8, and both codes are not ASCII.

Your problem is, I think, that you think the magic of decoding source
code from the byte sequence into unicode happens in exec or eval. It
doesn't. It happens in between reading the file and passing the
contents of the file to exec or eval.