unicode, bytes redux
Fredrik Lundh
fredrik at pythonware.com
Mon Sep 25 13:50:17 EDT 2006
willie wrote:
> Is it too ridiculous to suggest that it'd be nice
> if the unicode object were to remember the
> encoding of the string it was decoded from?
> So that it's feasible to calculate the number
> of bytes that make up the unicode code points.
>
> # U+270C
> # 11100010 10011100 10001100
> buf = "\xE2\x9C\x8C"
>
> u = buf.decode('UTF-8')
>
> # ... later ...
>
> u.bytes() -> 3
>
> (goes through each code point and calculates
> the number of bytes that make up the character
> according to the encoding)
what about:
buf = "\xE2\x9C\x8C"
bytes = buf.decode("utf-8")
# ... later ...
print bytes -> 3
or even
class utf8string(unicode):
def __new__(cls, data):
return unicode.__new__(cls, data, "utf-8")
def __init__(self, data):
self.bytes = len(data)
buf = "\xE2\x9C\x8C"
u = utf8string(buf)
# ... later ...
print repr(u)
print u.bytes -> 3
</F>
More information about the Python-list
mailing list