[Tutor] Python String and Unicode data types and Encode Decode Functions

Anshu Kumar anshu.kumar726 at gmail.com
Mon Dec 21 04:05:25 EST 2015


Hi Steven.

Thanks a lot for your reply. Now I have some clarity.

Could you please confirm if i get it correct?


1. There is an internal encoding of input characters to bytes which depends
on python version and os
2. For non unicode types every text gets encoded using point 1 and gets
printed using the decode of same way.
3. Unicode is universal coding and will not fail to accommodate any input
character.
4. For unicode type first character is encoded to unicode bytes and then
encoded to get stored in memory
5. thats why we need to encode the unicode characters to bytes

Thanks and Regards,
Anshu

On Sun, Dec 20, 2015 at 3:55 PM, Steven D'Aprano <steve at pearwood.info>
wrote:

> On Sun, Dec 20, 2015 at 08:17:21AM +0530, Anshu Kumar wrote:
>
> > I know certain facts like
>
> What version of Python are you using? My *guess* is that you are using
> Python 2.7, is that correct?
>
> What operating system are you using? Windows, Linux, Mac OS X, Unix,
> something else?
>
>
> > 1. String is nothing but a byte array so it has only 8 bits to encode
> > character using ascii, so it should not be used whenever we have
> characters
> > from other language thats why a broader type unicode is used.
>
> If you are using Python 2, this is correct.
>
> > 2. Python internally uses different implementation  to store strings in
> RAM
>
> Correct.
>
>
> > 3. print function can print both string and unicode because it has some
> > kind of function overloading.
>
> Mostly correct. `print` can print any object in Python (within reason).
> The details are probably not important.
>
>
> > 4. u'' , that is u prefixed before single quotes or double quotes tells
> > python interpreter that the following type is unicode and not a string.
>
> In Python 2, this is correct.
>
>
> > Now my doubts start
>
> Unicode is sometimes hard to understand, and unfortunately Python 2
> makes it even harder rather than easier.
>
>
> > *1. I tried below code and see that japanese characters can be
> accommodated
> > in strings. I do not get how is it possible?*
> >
> > >>> temo = 'いい'
> > >>> temo
> > '\xe3\x81\x84\xe3\x81\x84'
> > >>> print temo
> > いい
> > >>> type(temo)
> > <type 'str'>
>
> This appears to be Python 2 code.
>
> The fact that this works is an accident of the terminal/console you are
> using. If you tried it on another computer, using a different OS, you
> might find the results will change or possibly won't work at all.
>
> What seems to be happening is this:
>
> (1) You type, or paste, two Unicode characters into the terminal input
> buffer, namely `HIRAGANA LETTER I` repeated twice.
>
> (2) The terminal is set to use UTF-8 encoding, so it puts the six bytes
> \xe3\x81\x84 \xe3\x81\x84 into the buffer, and displays HIRAGANA LETTER
> I twice.
>
> (3) Python generates a string object containing six bytes (as above).
>
> (4) When you print those six bytes, the terminal recognises this as
> UTF-8, and displays HIRAGANA LETTER I (twice).
>
>
> This seems to work, but it is an accident. If you change the terminal
> settings, you will see different results:
>
>
> # terminal using UTF-8 as the encoding
> py> s = 'いい'  # looks like HIRAGANA LETTER I twice but actually six bytes
> py> print s  # terminal recognises this as UTF-8
> いい
> py> s  # The actual six bytes.
> '\xe3\x81\x84\xe3\x81\x84'
>
> # now I change the terminal to use ISO-8859-7 (Greek) instead.
> # the same six bytes now display as a Greek character plus invisible
> # control characters
> py> print s
> γγ
>
>
> # now I change the terminal to use Latin-1 ISO-8859-1
> py> print s
> ãã
>
>
> So the results you get are dependent on the terminal's encoding. The
> fact that it happens to work on your computer is a lucky accident.
>
>
> Instead, you should ensure Python knows to use proper Unicode text:
>
> py> s = u'いい'
> py> print s
> いい
>
>
> Provided your terminal is capable of entering the HIRAGANA LETTER I
> character in the first place, then s will ALWAYS be treated as that same
> HIRAGANA LETTER I. (Although, if you change the encoding of the
> terminal, it may print differently. That's not Python's fault -- that's
> the terminal.)
>
> In this case, instead of s being a *byte* string of three bytes,
> \xe3 \x81 \x84, s is a Unicode string of ONE character い. (Double
> everything if you enter the character twice.)
>
>
> > *2. When i try to iterate over characters i do not get anything
> meaningful*
> >
> > for character in temo:
> > ...     print character
> > ...
>
> You are trying to print the six control characters
>
> \xe3 \x81 \x84 \xe3 \x81 \x84
>
> What they look like on your system could be anything -- a blank space,
> no space at all, or the "missing character" glyph.
>
>
> > *3 . When I do I get length  as 6 *
> >
> > len(temo)
> > 6
> >
> > Why so?
>
> Because your terminal has entered each Unicode character as three UTF-8
> bytes; you have two characters, and 2*3 is 6. Hence the byte string is
> length six.
>
> If you use a Unicode string, the length will be two Unicode characters.
>
> Internally, in the computer's RAM, those two characters might be stored
> as any of the following bytes:
>
> # UTF-16 Big Endian
> \x30 \x44 \x30 \x44
>
> # UTF-16 Little Endian
> \x44 \x30 \x44 \x30
>
> # UTF-32 Big Endian
> \x00 \x00 \x30 \x44 \x00 \x00 \x30 \x44
>
> # UTF-32 Little Endian
> \x44 \x30 \x00 \x00 \x44 \x30 \x00 \x00
>
> # UTF-8
> \xe3 \x81 \x84 \xe3 \x81 \x84
>
> depending on the implementation and version of Python. The internal
> representation isn't very important, the important thing is that Unicode
> strings are treated as sequences of Unicode characters, not as sequences
> of bytes.
>
>
> > *4.  When i try to spit out each character I get below error*
> >
> >  for character in temo:
> > ...     print character.encode('utf-8')
> > ...
> > Traceback (most recent call last):
> >   File "<stdin>", line 2, in <module>
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0:
> > ordinal not in range(128)
>
>
> You have two problems here. The first is that you are trying to
> encode bytes to bytes. That doesn't work. The rule is:
>
> Unicode characters *encode* to bytes;
> Bytes *decode* to Unicode characters.
>
> Unfortunately, Python 2 has a design flaw that allows you to attempt the
> following (wrong) transformations:
>
> # Wrong, don't do this:
> Unicode characters decoded to bytes;
> Bytes encoded to Unicode characters.
>
> Python 2 will try to make sense of this, but will usually get it wrong.
> That's what is happening here: you're asking for an impossible
> transformation (bytes encoded to Unicode) so Python tries to be helpful
> by guess that maybe you wanted to *decode* the bytes to ASCII first,
> only that failed.
>
> This confusing and broken design is fixed in Python 3.
>
>
> The RIGHT way to do this is to use a Unicode string, not a byte string:
>
> py> s = u'い'  # HIRAGANA LETTER I
> py> print s
>> py> b = s.encode('utf-8')
> py> b
> '\xe3\x81\x84'
> py> b.decode('utf-8') == s
> True
>
>
>
> Hope that this helps, and feel free to ask more questions!
>
>
>
> --
> Steve
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>


More information about the Tutor mailing list