unicode, bytes redux

Mon Sep 25 12:50:20 EDT 2006

Steven D'Aprano wrote:
> On Mon, 25 Sep 2006 00:45:29 -0700, Paul Rubin wrote:
> 
>> willie <willie at jamots.com> writes:
>>> # U+270C
>>> # 11100010 10011100 10001100
>>> buf = "\xE2\x9C\x8C"
>>> u = buf.decode('UTF-8')
>>> # ... later ...
>>> u.bytes() -> 3
>>>
>>> (goes through each code point and calculates
>>> the number of bytes that make up the character
>>> according to the encoding)
>> Duncan Booth explains why that doesn't work.  But I don't see any big
>> problem with a byte count function that lets you specify an encoding:
>>
>>      u = buf.decode('UTF-8')
>>      # ... later ...
>>      u.bytes('UTF-8') -> 3
>>      u.bytes('UCS-4') -> 4
>>
>> That avoids creating a new encoded string in memory, and for some
>> encodings, avoids having to scan the unicode string to add up the
>> lengths.
> 
> Unless I'm misunderstanding something, your bytes code would have to
> perform exactly the same algorithmic calculations as converting the
> encoded string in the first place, except it doesn't need to store the
> newly encoded string, merely the number of bytes of each character.
> 
> Here is a bit of pseudo-code that might do what you want:
> 
> def bytes(unistring, encoding):
>     length = 0
>     for c in unistring:
>         length += len(c.encode(encoding))
>     return length

That wouldn't work for stateful encodings:

>>> len(u"abc".encode("utf-16"))
8
>>> bytes(u"abc", "utf-16")
12

Use a stateful encoder instead:

import codecs
def bytes(unistring, encoding):
    length = 0
    enc = codecs.getincrementalencoder(encoding)()
    for c in unistring:
        length += len(enc.encode(c))
    return length

Servus,
   Walter