python-unicode doesn't support >65535 symbols?
Martin v. Löwis
martin at v.loewis.de
Thu Nov 27 14:12:06 EST 2003
"Rainer Deyke" <rainerd at eldwood.com> writes:
> Python makes the mistake of exposing the internal representation instead of
> the logical value of unicode objects. This means that, aside from space
> optimization, unicode objects have no advantage over UTF-8 encoded plain
> strings for storing unicode text.
That is not true. First, it is not "Python", but a specific Python
configuration - in "wide Unicode" builds, it uses UCS-4 internally.
In either build, len() and indexing addresses code units, not
characters: that is true.
However, it is not true that there is no advantage over UTF-8 encoded
byte strings. Instead, there are several advantages:
- In a UCS-4 build, Unicode characters and code units are in a 1:1
relationship
- In a UCS-2 build, Unicode characters and code units are in a 1:1
relationship as long as the application only ever processes BMP
characters.
- In either case, a Unicode object has inherent information about the
character set, which a UTF-8 byte string does not have. IOW, you know
what a Unicode object is, but you don't know (inherently) whether a
byte string is UTF-8.
Regards,
Martin
More information about the Python-list
mailing list