unicode, bytes redux
Walter Dörwald
walter at livinglogic.de
Mon Sep 25 12:50:20 EDT 2006
Steven D'Aprano wrote:
> On Mon, 25 Sep 2006 00:45:29 -0700, Paul Rubin wrote:
>
>> willie <willie at jamots.com> writes:
>>> # U+270C
>>> # 11100010 10011100 10001100
>>> buf = "\xE2\x9C\x8C"
>>> u = buf.decode('UTF-8')
>>> # ... later ...
>>> u.bytes() -> 3
>>>
>>> (goes through each code point and calculates
>>> the number of bytes that make up the character
>>> according to the encoding)
>> Duncan Booth explains why that doesn't work. But I don't see any big
>> problem with a byte count function that lets you specify an encoding:
>>
>> u = buf.decode('UTF-8')
>> # ... later ...
>> u.bytes('UTF-8') -> 3
>> u.bytes('UCS-4') -> 4
>>
>> That avoids creating a new encoded string in memory, and for some
>> encodings, avoids having to scan the unicode string to add up the
>> lengths.
>
> Unless I'm misunderstanding something, your bytes code would have to
> perform exactly the same algorithmic calculations as converting the
> encoded string in the first place, except it doesn't need to store the
> newly encoded string, merely the number of bytes of each character.
>
> Here is a bit of pseudo-code that might do what you want:
>
> def bytes(unistring, encoding):
> length = 0
> for c in unistring:
> length += len(c.encode(encoding))
> return length
That wouldn't work for stateful encodings:
>>> len(u"abc".encode("utf-16"))
8
>>> bytes(u"abc", "utf-16")
12
Use a stateful encoder instead:
import codecs
def bytes(unistring, encoding):
length = 0
enc = codecs.getincrementalencoder(encoding)()
for c in unistring:
length += len(enc.encode(c))
return length
Servus,
Walter
More information about the Python-list
mailing list