[Tutor] Python String and Unicode data types and Encode Decode Functions

Sun Dec 20 05:25:27 EST 2015

On Sun, Dec 20, 2015 at 08:17:21AM +0530, Anshu Kumar wrote:

> I know certain facts like

What version of Python are you using? My *guess* is that you are using 
Python 2.7, is that correct?

What operating system are you using? Windows, Linux, Mac OS X, Unix, 
something else?

> 1. String is nothing but a byte array so it has only 8 bits to encode
> character using ascii, so it should not be used whenever we have characters
> from other language thats why a broader type unicode is used.

If you are using Python 2, this is correct.

> 2. Python internally uses different implementation  to store strings in RAM

Correct.

> 3. print function can print both string and unicode because it has some
> kind of function overloading.

Mostly correct. `print` can print any object in Python (within reason). 
The details are probably not important.

> 4. u'' , that is u prefixed before single quotes or double quotes tells
> python interpreter that the following type is unicode and not a string.

In Python 2, this is correct.

> Now my doubts start

Unicode is sometimes hard to understand, and unfortunately Python 2 
makes it even harder rather than easier.

> *1. I tried below code and see that japanese characters can be accommodated
> in strings. I do not get how is it possible?*
> 
> >>> temo = 'いい'
> >>> temo
> '\xe3\x81\x84\xe3\x81\x84'
> >>> print temo
> いい
> >>> type(temo)
> <type 'str'>

This appears to be Python 2 code.

The fact that this works is an accident of the terminal/console you are 
using. If you tried it on another computer, using a different OS, you 
might find the results will change or possibly won't work at all.

What seems to be happening is this:

(1) You type, or paste, two Unicode characters into the terminal input 
buffer, namely `HIRAGANA LETTER I` repeated twice.

(2) The terminal is set to use UTF-8 encoding, so it puts the six bytes 
\xe3\x81\x84 \xe3\x81\x84 into the buffer, and displays HIRAGANA LETTER 
I twice.

(3) Python generates a string object containing six bytes (as above).

(4) When you print those six bytes, the terminal recognises this as 
UTF-8, and displays HIRAGANA LETTER I (twice).

This seems to work, but it is an accident. If you change the terminal 
settings, you will see different results:

# terminal using UTF-8 as the encoding
py> s = 'いい'  # looks like HIRAGANA LETTER I twice but actually six bytes
py> print s  # terminal recognises this as UTF-8
いい
py> s  # The actual six bytes.
'\xe3\x81\x84\xe3\x81\x84'

# now I change the terminal to use ISO-8859-7 (Greek) instead.
# the same six bytes now display as a Greek character plus invisible 
# control characters
py> print s  
γγ

# now I change the terminal to use Latin-1 ISO-8859-1
py> print s
ãã

So the results you get are dependent on the terminal's encoding. The 
fact that it happens to work on your computer is a lucky accident.

Instead, you should ensure Python knows to use proper Unicode text:

py> s = u'いい'
py> print s
いい

Provided your terminal is capable of entering the HIRAGANA LETTER I 
character in the first place, then s will ALWAYS be treated as that same 
HIRAGANA LETTER I. (Although, if you change the encoding of the 
terminal, it may print differently. That's not Python's fault -- that's 
the terminal.)

In this case, instead of s being a *byte* string of three bytes, 
\xe3 \x81 \x84, s is a Unicode string of ONE character い. (Double 
everything if you enter the character twice.)

> *2. When i try to iterate over characters i do not get anything meaningful*
> 
> for character in temo:
> ...     print character
> ...

You are trying to print the six control characters 

\xe3 \x81 \x84 \xe3 \x81 \x84

What they look like on your system could be anything -- a blank space, 
no space at all, or the "missing character" glyph.

> *3 . When I do I get length  as 6 *
> 
> len(temo)
> 6
> 
> Why so?

Because your terminal has entered each Unicode character as three UTF-8 
bytes; you have two characters, and 2*3 is 6. Hence the byte string is 
length six.

If you use a Unicode string, the length will be two Unicode characters.

Internally, in the computer's RAM, those two characters might be stored 
as any of the following bytes:

# UTF-16 Big Endian
\x30 \x44 \x30 \x44

# UTF-16 Little Endian
\x44 \x30 \x44 \x30

# UTF-32 Big Endian
\x00 \x00 \x30 \x44 \x00 \x00 \x30 \x44

# UTF-32 Little Endian
\x44 \x30 \x00 \x00 \x44 \x30 \x00 \x00

# UTF-8
\xe3 \x81 \x84 \xe3 \x81 \x84

depending on the implementation and version of Python. The internal 
representation isn't very important, the important thing is that Unicode 
strings are treated as sequences of Unicode characters, not as sequences 
of bytes.

> *4.  When i try to spit out each character I get below error*
> 
>  for character in temo:
> ...     print character.encode('utf-8')
> ...
> Traceback (most recent call last):
>   File "<stdin>", line 2, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0:
> ordinal not in range(128)

You have two problems here. The first is that you are trying to 
encode bytes to bytes. That doesn't work. The rule is:

Unicode characters *encode* to bytes;
Bytes *decode* to Unicode characters.

Unfortunately, Python 2 has a design flaw that allows you to attempt the 
following (wrong) transformations:

# Wrong, don't do this:
Unicode characters decoded to bytes;
Bytes encoded to Unicode characters.

Python 2 will try to make sense of this, but will usually get it wrong. 
That's what is happening here: you're asking for an impossible 
transformation (bytes encoded to Unicode) so Python tries to be helpful 
by guess that maybe you wanted to *decode* the bytes to ASCII first, 
only that failed.

This confusing and broken design is fixed in Python 3.

The RIGHT way to do this is to use a Unicode string, not a byte string:

py> s = u'い'  # HIRAGANA LETTER I
py> print s
い
py> b = s.encode('utf-8')
py> b
'\xe3\x81\x84'
py> b.decode('utf-8') == s
True

Hope that this helps, and feel free to ask more questions!

-- 
Steve