[I18n-sig] How does Python Unicode treat surrogates?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 26 Jun 2001 02:16:08 +0200


> The maximum code-point value for a Unicode character is U+10FFFF,
> hence the suggested notation above (I should have noted it as
> such). If Python is going to implement full support for ISO 10646 then
> the full 32-bit representation (and 8-digit \U escape) is
> appropriate. 

Correct me if I'm wrong, but doesn't some 10646 amendment limit the
code range to 10FFFF also (i.e. to only a part of group 0)?

> If you limit the maximum size of the character escape so that the
> scanner catches improper character sizes you save grief for the
> end-user, IMHO.

I think Python should still use the \UXXXXXXXX notation, as does C and
C++ - no matter that the first two XX will always be 00.

> I understand O(n) and O(1) perfectly well. My point is that you do not
> have to scan the entire string when doing this indexing. You only need
> to look at most one storage unit on either side of the index. We're
> only concerned here with transparently handling surrogates when the
> underlying representation is UTF-16.

Please think carefully. What if you are indexing index 20, but you
have a surrogate at words 10 and 11? Then you should take word 21,
instead of word 20, no? How are you going to find that out in constant
time?

Regards,
Martin