unicode, bytes redux

Mon Sep 25 04:17:47 EDT 2006

willie wrote:
> (beating a dead horse)
>
> Is it too ridiculous to suggest that it'd be nice
> if the unicode object were to remember the
> encoding of the string it was decoded from?

Where it's been is irrelevant. Where it's going to is what matters.

> So that it's feasible to calculate the number
> of bytes that make up the unicode code points.
>
> # U+270C
> # 11100010 10011100 10001100
> buf = "\xE2\x9C\x8C"
>
> u = buf.decode('UTF-8')
>
> # ... later ...
>
> u.bytes() -> 3
>
> (goes through each code point and calculates
> the number of bytes that make up the character
> according to the encoding)

Suppose the unicode object was decoded using some encoding other than
the one that's going to be used to store the info in the database:

| >>> sg = '\xc9\xb5\xb9\xcf'
| >>> len(sg)
| 4
| >>> u = sg.decode('gb2312')

later:
u.bytes() => 4

but

| >>> len(u.encode('utf8'))
| 6

and by the way, what about the memory overhead of storing the name of
the encoding (in the above case 7 (6 + overhead))?

What would u"abcdef".bytes() produce? An exception?

HTH,
John