[Python-Dev] 2.2 Unicode questions

Guido van Rossum guido@digicool.com
Thu, 19 Jul 2001 10:09:33 -0400


> > Untrue: it supports range(0x110000) (in UCS-2 mode this returns a
> > surrogate pair).  Now, maybe that's not what it *should* do...
> 
> It should definitely not, unless you want to break code which assumes
> that chr() and unichr() always return a single byte/code unit !

Reasonable people can disagree about this.

> This was part of the UCS-4 checkins which hadn't had time yet to 
> review. Should I remove the surrogate part for narrow builds ?

Well, this snuck into the 2.2a1, so hopefully we'll get some comments
("love it" / "hate it") from the field to guide our decision.

> > > and there's no \code{\e U} notation for embedding characters
> > > greater than 65535 in a Unicode string literal.
> > 
> > Not true either -- correct \U has been part of Python since 2.0.  It
> > does the same thing as unichr() described above.
> 
> Right.
> 
> Note that in this case, the handling of surrogates is needed
> to make the unicode-escape encoding roundtrip safe.

I don't understand what this means.  Can you give an example?

--Guido van Rossum (home page: http://www.python.org/~guido/)