unicode, bytes redux

John Roth JohnRoth1 at jhrothjr.com
Mon Sep 25 10:03:49 EDT 2006


willie wrote:
> (beating a dead horse)
>
> Is it too ridiculous to suggest that it'd be nice
> if the unicode object were to remember the
> encoding of the string it was decoded from?
> So that it's feasible to calculate the number
> of bytes that make up the unicode code points.
>
> # U+270C
> # 11100010 10011100 10001100
> buf = "\xE2\x9C\x8C"
>
> u = buf.decode('UTF-8')
>
> # ... later ...
>
> u.bytes() -> 3
>
> (goes through each code point and calculates
> the number of bytes that make up the character
> according to the encoding)

Yup, it's a dead horse. As suggested elsewhere in
the thread, the unicode object is not the proper
place for this functionality. Also, as suggested,
it's not even the desired functionality: what's really
wanted is the ability to tell how long the string
is going to be in various encodings.

That's easy enough to do today - just encode the
darn thing and use len(). I don't see any reason
to expand the language to support a data base
product that goes out of its way to make it difficult
for developers.

John Roth




More information about the Python-list mailing list