[I18n-sig] Unicode surrogates: just say no!

Tom Emerson tree@basistech.com
Tue, 26 Jun 2001 10:39:51 -0400


Martin v. Loewis writes:
> > Martin has hinted at a solution requiring even less memory per string
> > object, but I don't know for sure what he is thinking of.  All I can
> > imagine is a single flag saying "this string contains no surrogates".
> 
> That was my original idea. I later thought have a count of surrogate
> pairs would be better, since it allows to compute len() in constant
> time. Indexing would be linear time only for strings containing
> surrogates, otherwise constant time also.

Just so I understand: the codec will set this flag/length when it
transcodes to the internal representation?

> [on sre]
> > There are two parts to this: the internal
> > engine needs to realize that e.g. "." and certain "[...]" sets may
> > match a surrogate pair, and the indices returned by e.g. the span()
> > method of match objects should be translated to character indices as
> > expected by the applications.
> 
> For character classes, it may be acceptable they must only contain BMP
> characters; span would use the conversion macros, and . would need
> special casing. I agree this is terrible, but it could work.

UTR #18 describes the impact of surrogates on regular expressions.

http://www.unicode.org/unicode/reports/tr18/#Surrogates

> Still, exploiting the platform's wchar_t might avoid copies in some
> cases (I'm thinking of my iconv codec in particular), so that would
> give a speed-up.

Excellent point.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"