byte count unicode string
willie
willie at jamots.com
Wed Sep 20 20:38:44 EDT 2006
>willie wrote:
>>
>> Thanks for the thorough explanation. One last question
>> about terminology then I'll go away :)
>> What is the proper way to describe "ustr" below?
>> >>> ustr = buf.decode('UTF-8')
>> >>> type(ustr)
>> <type 'unicode'>
>> Is it a "unicode object that contains a UTF-8 encoded
>> string object?"
John Machin:
>No. It is a Python unicode object, period.
>
>1. If it did contain another object you would be (quite justifiably)
>screaming your peripherals off about the waste of memory :-)
>2. You don't need to concern yourself with the internals of a unicode
>object; however rest assured that it is *not* stored as UTF-8 -- so if
>you are hoping for a quick "number of utf 8 bytes without actually
>producing a str object" method, you are out of luck.
>
>Consider this example: you have a str object which contains some
>Russian text, encoded in cp1251.
>
>str1 = russian_text
>unicode1 = str1.decode('cp1251')
>str2 = unicode1.encode('utf-8')
>unicode2 = str2.decode('utf-8')
>Then unicode2 == unicode1, repr(unicode2) == repr(unicode1), there is
>no way (without the above history) of determining how it was created --
>and you don't need to care how it was created.
Gabriel Genellina:
>ustr is an unicode object. Period. An unicode object contains
>characters (not bytes).
>buf, apparently, is a string - a string of bytes. Those bytes
>apparently represent some unicode characters encoded using the UTF-8
>encoding. So, you can decode them -using the decode() method- to get
>the unicode object.
>
>Very roughly, the difference is like that of an integer and its
>representations:
>w = 1
>x = 0x0001
>y = 001
>z = struct.unpack('>h','\x00\x01')
>All three objects are the *same* integer, 1.
>There is no way of knowing *how* the integer was spelled, i.e., from
>which representation it comes from - like the unicode object, it has
>no "encoding" by itself.
>You can go back and forth between an integer number and its decimal
>representation - like astring.decode() and ustring.encode()
I finally understand, much appreciated.
More information about the Python-list
mailing list