[Tutor] Python String and Unicode data types and Encode Decode Functions
Steven D'Aprano
steve at pearwood.info
Sun Dec 20 05:25:27 EST 2015
On Sun, Dec 20, 2015 at 08:17:21AM +0530, Anshu Kumar wrote:
> I know certain facts like
What version of Python are you using? My *guess* is that you are using
Python 2.7, is that correct?
What operating system are you using? Windows, Linux, Mac OS X, Unix,
something else?
> 1. String is nothing but a byte array so it has only 8 bits to encode
> character using ascii, so it should not be used whenever we have characters
> from other language thats why a broader type unicode is used.
If you are using Python 2, this is correct.
> 2. Python internally uses different implementation to store strings in RAM
Correct.
> 3. print function can print both string and unicode because it has some
> kind of function overloading.
Mostly correct. `print` can print any object in Python (within reason).
The details are probably not important.
> 4. u'' , that is u prefixed before single quotes or double quotes tells
> python interpreter that the following type is unicode and not a string.
In Python 2, this is correct.
> Now my doubts start
Unicode is sometimes hard to understand, and unfortunately Python 2
makes it even harder rather than easier.
> *1. I tried below code and see that japanese characters can be accommodated
> in strings. I do not get how is it possible?*
>
> >>> temo = 'いい'
> >>> temo
> '\xe3\x81\x84\xe3\x81\x84'
> >>> print temo
> いい
> >>> type(temo)
> <type 'str'>
This appears to be Python 2 code.
The fact that this works is an accident of the terminal/console you are
using. If you tried it on another computer, using a different OS, you
might find the results will change or possibly won't work at all.
What seems to be happening is this:
(1) You type, or paste, two Unicode characters into the terminal input
buffer, namely `HIRAGANA LETTER I` repeated twice.
(2) The terminal is set to use UTF-8 encoding, so it puts the six bytes
\xe3\x81\x84 \xe3\x81\x84 into the buffer, and displays HIRAGANA LETTER
I twice.
(3) Python generates a string object containing six bytes (as above).
(4) When you print those six bytes, the terminal recognises this as
UTF-8, and displays HIRAGANA LETTER I (twice).
This seems to work, but it is an accident. If you change the terminal
settings, you will see different results:
# terminal using UTF-8 as the encoding
py> s = 'いい' # looks like HIRAGANA LETTER I twice but actually six bytes
py> print s # terminal recognises this as UTF-8
いい
py> s # The actual six bytes.
'\xe3\x81\x84\xe3\x81\x84'
# now I change the terminal to use ISO-8859-7 (Greek) instead.
# the same six bytes now display as a Greek character plus invisible
# control characters
py> print s
γγ
# now I change the terminal to use Latin-1 ISO-8859-1
py> print s
ãã
So the results you get are dependent on the terminal's encoding. The
fact that it happens to work on your computer is a lucky accident.
Instead, you should ensure Python knows to use proper Unicode text:
py> s = u'いい'
py> print s
いい
Provided your terminal is capable of entering the HIRAGANA LETTER I
character in the first place, then s will ALWAYS be treated as that same
HIRAGANA LETTER I. (Although, if you change the encoding of the
terminal, it may print differently. That's not Python's fault -- that's
the terminal.)
In this case, instead of s being a *byte* string of three bytes,
\xe3 \x81 \x84, s is a Unicode string of ONE character い. (Double
everything if you enter the character twice.)
> *2. When i try to iterate over characters i do not get anything meaningful*
>
> for character in temo:
> ... print character
> ...
You are trying to print the six control characters
\xe3 \x81 \x84 \xe3 \x81 \x84
What they look like on your system could be anything -- a blank space,
no space at all, or the "missing character" glyph.
> *3 . When I do I get length as 6 *
>
> len(temo)
> 6
>
> Why so?
Because your terminal has entered each Unicode character as three UTF-8
bytes; you have two characters, and 2*3 is 6. Hence the byte string is
length six.
If you use a Unicode string, the length will be two Unicode characters.
Internally, in the computer's RAM, those two characters might be stored
as any of the following bytes:
# UTF-16 Big Endian
\x30 \x44 \x30 \x44
# UTF-16 Little Endian
\x44 \x30 \x44 \x30
# UTF-32 Big Endian
\x00 \x00 \x30 \x44 \x00 \x00 \x30 \x44
# UTF-32 Little Endian
\x44 \x30 \x00 \x00 \x44 \x30 \x00 \x00
# UTF-8
\xe3 \x81 \x84 \xe3 \x81 \x84
depending on the implementation and version of Python. The internal
representation isn't very important, the important thing is that Unicode
strings are treated as sequences of Unicode characters, not as sequences
of bytes.
> *4. When i try to spit out each character I get below error*
>
> for character in temo:
> ... print character.encode('utf-8')
> ...
> Traceback (most recent call last):
> File "<stdin>", line 2, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0:
> ordinal not in range(128)
You have two problems here. The first is that you are trying to
encode bytes to bytes. That doesn't work. The rule is:
Unicode characters *encode* to bytes;
Bytes *decode* to Unicode characters.
Unfortunately, Python 2 has a design flaw that allows you to attempt the
following (wrong) transformations:
# Wrong, don't do this:
Unicode characters decoded to bytes;
Bytes encoded to Unicode characters.
Python 2 will try to make sense of this, but will usually get it wrong.
That's what is happening here: you're asking for an impossible
transformation (bytes encoded to Unicode) so Python tries to be helpful
by guess that maybe you wanted to *decode* the bytes to ASCII first,
only that failed.
This confusing and broken design is fixed in Python 3.
The RIGHT way to do this is to use a Unicode string, not a byte string:
py> s = u'い' # HIRAGANA LETTER I
py> print s
い
py> b = s.encode('utf-8')
py> b
'\xe3\x81\x84'
py> b.decode('utf-8') == s
True
Hope that this helps, and feel free to ask more questions!
--
Steve
More information about the Tutor
mailing list