[Python-Dev] 2.2 Unicode questions
Guido van Rossum
guido@digicool.com
Thu, 19 Jul 2001 10:09:33 -0400
> > Untrue: it supports range(0x110000) (in UCS-2 mode this returns a
> > surrogate pair). Now, maybe that's not what it *should* do...
>
> It should definitely not, unless you want to break code which assumes
> that chr() and unichr() always return a single byte/code unit !
Reasonable people can disagree about this.
> This was part of the UCS-4 checkins which hadn't had time yet to
> review. Should I remove the surrogate part for narrow builds ?
Well, this snuck into the 2.2a1, so hopefully we'll get some comments
("love it" / "hate it") from the field to guide our decision.
> > > and there's no \code{\e U} notation for embedding characters
> > > greater than 65535 in a Unicode string literal.
> >
> > Not true either -- correct \U has been part of Python since 2.0. It
> > does the same thing as unichr() described above.
>
> Right.
>
> Note that in this case, the handling of surrogates is needed
> to make the unicode-escape encoding roundtrip safe.
I don't understand what this means. Can you give an example?
--Guido van Rossum (home page: http://www.python.org/~guido/)