[I18n-sig] Unicode surrogates: just say no!

Guido van Rossum guido@digicool.com
Wed, 27 Jun 2001 15:57:12 -0400


> Guido van Rossum wrote:
> > 
> >...
> > 
> > Oooh, hadn't thought of that, but yes, it makes sense!
> > 
> > Not yet implemented, but I think it should.  Makes for a nice pair
> > of invariants:
> > 
> >   unichr(ord('\Udddddddd')) == '\Udddddddd'
> >   ord(unichr(0xdddddddd)) == 0xdddddddd
> > 
> > regardless of whether we're using UCS-2 or UCS-4 storage.
> 
> I'm going to presume that ord should accept surrogate pairs on both
> narrow and wide interpreters.

That's a separate question.  On wide interpreters, surrogate pairs
"shouldn't" exist if the app plays by the rules.  But they're easily
created of course!  What should ord(u'\uD800\uDC00') mean on a wide
interpreter?  I think it's nice if you support this.  Of course, if a
length-two Unicode string is anything else than a high surrogate
followed by a low surrogate, ord() should be illegal.

--Guido van Rossum (home page: http://www.python.org/~guido/)