byte count unicode string

willie willie at jamots.com
Wed Sep 20 20:38:44 EDT 2006


 >willie wrote:
 >>
 >> Thanks for the thorough explanation. One last question
 >> about terminology then I'll go away :)
 >> What is the proper way to describe "ustr" below?

 >>  >>> ustr = buf.decode('UTF-8')
 >>  >>> type(ustr)
 >> <type 'unicode'>

 >> Is it a "unicode object that contains a UTF-8 encoded
 >> string object?"


John Machin:

 >No. It is a Python unicode object, period.
 >
 >1. If it did contain another object you would be (quite justifiably)
 >screaming your peripherals off about the waste of memory :-)
 >2. You don't need to concern yourself with the internals of a unicode
 >object; however rest assured that it is *not* stored as UTF-8 -- so if
 >you are hoping for a quick "number of utf 8 bytes without actually
 >producing a str object" method, you are out of luck.
 >
 >Consider this example: you have a str object which contains some
 >Russian text, encoded in cp1251.
 >
 >str1 = russian_text
 >unicode1 = str1.decode('cp1251')
 >str2 = unicode1.encode('utf-8')
 >unicode2 = str2.decode('utf-8')
 >Then unicode2 == unicode1, repr(unicode2) == repr(unicode1), there is
 >no way (without the above history) of determining how it was created --
 >and you don't need to care how it was created.


Gabriel Genellina:

 >ustr is an unicode object. Period. An unicode object contains
 >characters (not bytes).
 >buf, apparently, is a string - a string of bytes. Those bytes
 >apparently represent some unicode characters encoded using the UTF-8
 >encoding. So, you can decode them -using the decode() method- to get
 >the unicode object.
 >
 >Very roughly, the difference is like that of an integer and its
 >representations:
 >w = 1
 >x = 0x0001
 >y = 001
 >z = struct.unpack('>h','\x00\x01')
 >All three objects are the *same* integer, 1.
 >There is no way of knowing *how* the integer was spelled, i.e., from
 >which representation it comes from - like the unicode object, it has
 >no "encoding" by itself.
 >You can go back and forth between an integer number and its decimal
 >representation - like astring.decode() and ustring.encode()

I finally understand, much appreciated.




More information about the Python-list mailing list