[I18n-sig] How does Python Unicode treat surrogates?

Tom Emerson tree@basistech.com
Mon, 25 Jun 2001 15:01:43 -0400


[ I'm the first to admit this hasn't been thought out... I'm writing off the cuff ]
Guido van Rossum writes:
> > foo = u"\u4e00\u020000a"
> > 
> > means that foo[0] == u"\u4e00", foo[1] == u"\u020000", and foo[2] ==
> > u"a".
> 
> I hope you meant foo = u"\u4e00\U00020000a" and foo[1] == u'\U00020000'.
> 
> (I worry that your sloppy use of variable length \u escapes above
> shows that your understanding of the subject matter is less than
> you've made me believe.  Please say it ain't so!)

The maximum code-point value for a Unicode character is U+10FFFF,
hence the suggested notation above (I should have noted it as
such). If Python is going to implement full support for ISO 10646 then
the full 32-bit representation (and 8-digit \U escape) is
appropriate. If you limit the maximum size of the character escape so
that the scanner catches improper character sizes you save grief for
the end-user, IMHO.

I must admit that I wasn't aware of the "\U00020000" notation. I still
think it should limit itself to 6 digits, not 8.

> > The fact that this is represented internally different ways shouldn't
> > matter to the user who only cares about characters.
> 
> You misunderstand.  I am claiming that this shouldn't happen because
> it would make u[i] an O(n) operation.  Then you brought up an argument
> that suggested a way of indexing that *wouldn't* make it O(n), and
> that's what I guessed (in my "Ouch" paragraph quoted above).
> 
> But what you describe now doesn't have a constant number of storage
> units per character, so it has to have O(n) indexing time (unless you
> assume a terribly hairy data structure).

I understand O(n) and O(1) perfectly well. My point is that you do not
have to scan the entire string when doing this indexing. You only need
to look at most one storage unit on either side of the index. We're
only concerned here with transparently handling surrogates when the
underlying representation is UTF-16.

> Note that in your above example, char(foo, 2) would not be u'a' but
> would be u'\u0000', and char(foo, 3) would be u'a'.

My example above presumes that indicies in the index refers to
characters, not storage units, and that UTF-16 is being used
transparently internally. So in my world, evaluating

foo = u"\u4e00\U00020000a"

would treat foo[1] as u'\U00200000' and foo[2] as u'a'.

> So I still think you haven't thought this out as much as you believe.

As I said, I have no belief that this is thought out. I'm merely
stating what I believe the observable behavior should be.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"