[I18n-sig] How does Python Unicode treat surrogates?

Walter Dörwald walter@livinglogic.de
Tue, 26 Jun 2001 11:56:49 +0200


Fredrik Lundh wrote:
> 
> guido wrote:
> 
> [...]
> > If we make a clean distinction between characters and storage units,
> > and if stick to the rule that u[i] accesses a storage unit, what's th=
e
> > conceptual difficulty?
> 
> I'm sceptical -- I see very little reason to maintain that distinction.
> let's use either UCS-2 or UCS-4 for the internal storage, stick to the
> "character strings are character sequences" concept, and keep the
> UTF-16 surrogate issue where it belongs: in the codecs.

Exactly!

Using UTF-16 as the internal storage and defining new methods for
accessing characters instead of code units essentially means
implementing
half a new string type. We'd have to duplicate every method Unicode
objects 
provide now. It would be two string type APIs combined in one type.

Do we really need 2 1/2 string types?

Bye,
	Walter Dörwald