[I18n-sig] Unicode surrogates: just say no!

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 26 Jun 2001 18:37:32 +0200


> > That was my original idea. I later thought have a count of surrogate
> > pairs would be better, since it allows to compute len() in constant
> > time. Indexing would be linear time only for strings containing
> > surrogates, otherwise constant time also.
> 
> Just so I understand: the codec will set this flag/length when it
> transcodes to the internal representation?

Depends on how it is written. At the C level, it could provide a
surrogate count when creating a string, or it could give -1, in which
case the implementation would count the surrogates. At the Python
level, there would be no interface into finding out the number of
surrogates, or setting them. Instead, unichr invocations with
arguments above 0xffff would set the count.

Regards,
Martin