unicode, bytes redux

Mon Sep 25 03:45:29 EDT 2006

willie <willie at jamots.com> writes:
> # U+270C
> # 11100010 10011100 10001100
> buf = "\xE2\x9C\x8C"
> u = buf.decode('UTF-8')
> # ... later ...
> u.bytes() -> 3
> 
> (goes through each code point and calculates
> the number of bytes that make up the character
> according to the encoding)

Duncan Booth explains why that doesn't work.  But I don't see any big
problem with a byte count function that lets you specify an encoding:

     u = buf.decode('UTF-8')
     # ... later ...
     u.bytes('UTF-8') -> 3
     u.bytes('UCS-4') -> 4

That avoids creating a new encoded string in memory, and for some
encodings, avoids having to scan the unicode string to add up the
lengths.