unicode, bytes redux

Mon Sep 25 13:50:17 EDT 2006

willie wrote:

> Is it too ridiculous to suggest that it'd be nice
> if the unicode object were to remember the
> encoding of the string it was decoded from?
> So that it's feasible to calculate the number
> of bytes that make up the unicode code points.
> 
> # U+270C
> # 11100010 10011100 10001100
> buf = "\xE2\x9C\x8C"
> 
> u = buf.decode('UTF-8')
> 
> # ... later ...
> 
> u.bytes() -> 3
> 
> (goes through each code point and calculates
> the number of bytes that make up the character
> according to the encoding)

what about:

     buf = "\xE2\x9C\x8C"
     bytes = buf.decode("utf-8")

     # ... later ...

     print bytes -> 3

or even

     class utf8string(unicode):
	def __new__(cls, data):
	    return unicode.__new__(cls, data, "utf-8")
	def __init__(self, data):
	    self.bytes = len(data)

     buf = "\xE2\x9C\x8C"

     u = utf8string(buf)

     # ... later ...

     print repr(u)
     print u.bytes -> 3

</F>