[I18n-sig] How does Python Unicode treat surrogates?
Walter Dörwald
walter@livinglogic.de
Tue, 26 Jun 2001 11:56:49 +0200
Fredrik Lundh wrote:
>
> guido wrote:
>
> [...]
> > If we make a clean distinction between characters and storage units,
> > and if stick to the rule that u[i] accesses a storage unit, what's th=
e
> > conceptual difficulty?
>
> I'm sceptical -- I see very little reason to maintain that distinction.
> let's use either UCS-2 or UCS-4 for the internal storage, stick to the
> "character strings are character sequences" concept, and keep the
> UTF-16 surrogate issue where it belongs: in the codecs.
Exactly!
Using UTF-16 as the internal storage and defining new methods for
accessing characters instead of code units essentially means
implementing
half a new string type. We'd have to duplicate every method Unicode
objects
provide now. It would be two string type APIs combined in one type.
Do we really need 2 1/2 string types?
Bye,
Walter Dörwald