[I18n-sig] Unicode surrogates: just say no!
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Wed, 27 Jun 2001 22:44:05 +0200
> > - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
> > and \U) generates a surrogate pair, where u[0] is the high
> > surrogate value and u[1] the low surrogate value
>
> Does this imply that ord() should take in surrogate pairs too?
Good question. IMO, it shouldn't, so ord(unichr(n)) may raise
exceptions, even for values of n where unichr(n) succeeds.
The basic rationale here is: if you need surrogates a lot, you should
use a wide unicode implementation. In a narrow unicode implementation,
a lot of surprises are likely (although each surprise should be
documented, of course).
In the specific case, there isn't even a single best solution: If ord
of a surrogate pair would return a value, you'd lose the property that
ord(s[0])==ord(s) either raises an exception or gives 1.
Regards,
Martin