unicode, bytes redux

Mon Sep 25 05:16:58 EDT 2006

On Mon, 25 Sep 2006 00:45:29 -0700, Paul Rubin wrote:

> willie <willie at jamots.com> writes:
>> # U+270C
>> # 11100010 10011100 10001100
>> buf = "\xE2\x9C\x8C"
>> u = buf.decode('UTF-8')
>> # ... later ...
>> u.bytes() -> 3
>> 
>> (goes through each code point and calculates
>> the number of bytes that make up the character
>> according to the encoding)
> 
> Duncan Booth explains why that doesn't work.  But I don't see any big
> problem with a byte count function that lets you specify an encoding:
> 
>      u = buf.decode('UTF-8')
>      # ... later ...
>      u.bytes('UTF-8') -> 3
>      u.bytes('UCS-4') -> 4
> 
> That avoids creating a new encoded string in memory, and for some
> encodings, avoids having to scan the unicode string to add up the
> lengths.

Unless I'm misunderstanding something, your bytes code would have to
perform exactly the same algorithmic calculations as converting the
encoded string in the first place, except it doesn't need to store the
newly encoded string, merely the number of bytes of each character.

Here is a bit of pseudo-code that might do what you want:

def bytes(unistring, encoding):
    length = 0
    for c in unistring:
        length += len(c.encode(encoding))
    return length

At the cost of some speed, you can avoid storing the entire encoded string
in memory, which might be what you want if you are dealing with truly
enormous unicode strings.

Alternatively, instead of calling encode() on each character, you can
write a function (presumably in C for speed) that does the exact same
thing as encode, but without storing the encoded characters, merely adding
their lengths. Now you have code duplication, which is usually a bad idea.
If for no other reason, some poor schmuck has to maintain them both! (And
I bet it won't be Willie, for all his enthusiasm for the idea.)

This whole question seems to me like an awful example of premature
optimization. Your computer has probably got well in excess of 100MB, and
you're worried about duplicating a few hundred or thousand (or even
hundred thousand) bytes for a few milliseconds (just long enough to grab
the length)?

-- 
Steven D'Aprano