byte count unicode string

Wed Sep 20 19:57:31 EDT 2006

willie wrote:
>
> Thanks for the thorough explanation. One last question
> about terminology then I'll go away :)
> What is the proper way to describe "ustr" below?
>
>  >>> ustr = buf.decode('UTF-8')
>  >>> type(ustr)
> <type 'unicode'>
>
>
> Is it a "unicode object that contains a UTF-8 encoded
> string object?"

No. It is a Python unicode object, period.

1. If it did contain another object you would be (quite justifiably)
screaming your peripherals off about the waste of memory :-)
2. You don't need to concern yourself with the internals of a unicode
object; however rest assured that it is *not* stored as UTF-8 -- so if
you are hoping for a quick "number of utf 8 bytes without actually
producing a str object" method, you are out of luck.

Consider this example: you have a str object which contains some
Russian text, encoded in cp1251.

str1 = russian_text
unicode1 = str1.decode('cp1251')
str2 = unicode1.encode('utf-8')
unicode2 = str2.decode('utf-8')
Then unicode2 == unicode1, repr(unicode2) == repr(unicode1), there is
no way (without the above history) of determining how it was created --
and you don't need to care how it was created.

HTH,
John