[Tutor] Python String and Unicode data types and Encode Decode Functions

Anshu Kumar anshu.kumar726 at gmail.com
Sat Dec 19 21:47:21 EST 2015


Hi Everyone,

In my current project I am dealing a lot with unicode type. There are some
text files which contain unicode to accommodate data in multiple languages.
I have to continuously parse these files in xml or yaml format using xml
and yaml libraries. I have encountered several errors due to unicode and
have to encode such texts to utf-8 using encode('utf-8') method. Though I
could resolve my issue but could not appreciate the datatypes unicode ,
string, encode and decode methods.

I know certain facts like

1. String is nothing but a byte array so it has only 8 bits to encode
character using ascii, so it should not be used whenever we have characters
from other language thats why a broader type unicode is used.

2. Python internally uses different implementation  to store strings in RAM

3. print function can print both string and unicode because it has some
kind of function overloading.


4. u'' , that is u prefixed before single quotes or double quotes tells
python interpreter that the following type is unicode and not a string.


Now my doubts start

*1. I tried below code and see that japanese characters can be accommodated
in strings. I do not get how is it possible?*

>>> temo = 'いい'
>>> temo
'\xe3\x81\x84\xe3\x81\x84'
>>> print temo
いい
>>> type(temo)
<type 'str'>
>>>


*2. When i try to iterate over characters i do not get anything meaningful*

for character in temo:
...     print character
...

�
�

�
�

*3 . When I do I get length  as 6 *

len(temo)
6

Why so?


*4.  When i try to spit out each character I get below error*

 for character in temo:
...     print character.encode('utf-8')
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0:
ordinal not in range(128)


Now I am not able to appreciate how unicode and string are working in
background with the facts I know. Please help me to understand this magic.

Thanks a lot in advance,
Anshu


More information about the Tutor mailing list