[I18n-sig] Unicode surrogates: just say no!

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 27 Jun 2001 22:44:05 +0200


> >   - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
> >     and \U) generates a surrogate pair, where u[0] is the high
> >     surrogate value and u[1] the low surrogate value
> 
> Does this imply that ord() should take in surrogate pairs too?

Good question. IMO, it shouldn't, so ord(unichr(n)) may raise
exceptions, even for values of n where unichr(n) succeeds. 

The basic rationale here is: if you need surrogates a lot, you should
use a wide unicode implementation. In a narrow unicode implementation,
a lot of surprises are likely (although each surprise should be
documented, of course).

In the specific case, there isn't even a single best solution: If ord
of a surrogate pair would return a value, you'd lose the property that
ord(s[0])==ord(s) either raises an exception or gives 1.

Regards,
Martin