[Tutor] Decode and Encode

Wed Jan 28 12:26:50 CET 2015

On Wed, Jan 28, 2015 at 03:05:58PM +0530, Sunil Tech wrote:
> Hi All,
> 
> When i copied a text from web and pasted in the python-terminal, it
> automatically coverted into unicode(i suppose)
> 
> can anyone tell me how it does?
> Eg:
> >>> p = "你好"
> >>> p
> '\xe4\xbd\xa0\xe5\xa5\xbd'

It is hard to tell exactly, since we cannot see what p is supposed to 
be. I am predicting that you are using Python 2.7, which uses 
byte-strings by default, not Unicode text-strings.

To really answer your question correctly, we need to know the operating 
system and which terminal you are using, and the terminal's encoding. I 
will guess a Linux system, with UTF-8 encoding in the terminal.

So, when you paste some Unicode text into the terminal, the terminal 
receives the UTF-8 bytes, and displays the characters:

你好

On my system, they display like boxes, but I expect that they are:

CJK UNIFIED IDEOGRAPH-4F60
CJK UNIFIED IDEOGRAPH-597D

But, because this is Python 2, and you used byte-strings "" instead of 
Unicode strings u"", Python sees the raw UTF-8 bytes.

py> s = u'你好'  # Note this is a Unicode string u'...'
py> import unicodedata
py> for c in s:
...     print unicodedata.name(c)
...
CJK UNIFIED IDEOGRAPH-4F60
CJK UNIFIED IDEOGRAPH-597D
py> s.encode('UTF-8')
'\xe4\xbd\xa0\xe5\xa5\xbd'

which matches your results.

Likewise for this example:

py> s = u'ªîV'  # make sure to use Unicode u'...'
py> for c in s:
...     print unicodedata.name(c)
...
FEMININE ORDINAL INDICATOR
LATIN SMALL LETTER I WITH CIRCUMFLEX
LATIN CAPITAL LETTER V
py> s.encode('utf8')
'\xc2\xaa\xc3\xaeV'

which matches yours:

> >>> o = 'ªîV'
> >>> o
> '\xc2\xaa\xc3\xaeV'

Obviously all this is confusing and harmful. In Python 3, the interpeter 
defaults to Unicode text strings, so that this issue goes away.

-- 
Steve