to express unicode string

Michael Torrie torriem at gmail.com
Sat Jan 28 14:58:31 EST 2012


On 01/28/2012 12:21 AM, contro opinion wrote:
>>>> s='你好'

On my computer, s is a byte string that contains the utf-8 formatted
encoding of 你好.  This has nothing to do with python, though, and
everything to do with the line editor python's interpreter is doing.  In
other words, the string is encoded to utf-8 before python even sees it.

So in this instance to convert s to a proper unicode string instead of a
utf-8-encoded byte string, you do:

us = s.decode('utf-8') #

The encoding of s probably depends on your terminal shell's encoding
system.  Mine is utf-8, so that's what s ends up encoded as.  This is
confusing isn't it.  You are dealing with several things together.  1.
The terminal's character set, 2. the python interpreter's line editor
(which is readline on my computer), and 3. python itself.

In cases where the script is run directly by the python interpreter, you
can specify the encoding of the python file at the beginning of the file
in a comment.  http://www.python.org/dev/peps/pep-0263/  I think that
most text editors will probably use utf-8 by default, so the string:
s = '你好' when looked at with a hex editor would be converted to utf-8
already.

s = '\xc4\xe3\xba\xc3'

>>>> t=u'你好'
>>>> s
> '\xc4\xe3\xba\xc3'

The result of these two lines is going to be different depending on your
terminal encoding scheme and the line editor.  As I said before, the
bytestring that s is assigned to is determined not by python in this
case, but by the editor and terminal.

>>>> t
> u'\u4f60\u597d'
>>>> t=us
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> NameError: name 'us' is not defined

Of course.  There is no variable called 'us'

>>>>
> how can i use us to express  u'你好'??

Provided your python file, terminal, and editor all agree on the text
encoding:
s = u'你好'

Python normally uses whatever is set in the environment, which on my
computer is en_US.UTF-8, hence utf-8.  Could be different on your computer.

or

s = u'\u4f60\u597d'

> can i add someting in  us  to  express   u'你好'??

That works directly on my terminal.

Unicode is definitely a challenge.  Python 3 makes it easier by
defaulting to unicode internally.  But you still have the challenge of
making sure your python source file is encoded in the proper encoding
(normally utf-8).




More information about the Python-list mailing list