[I18n-sig] How does Python Unicode treat surrogates?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 26 Jun 2001 07:21:35 +0200


> Martin v. Loewis writes:
> > > How then is u"\U00200000" represented internally if you use UCS-2 as
> > > the internal storage representation?
> > 
> > I think the obvious answer is: It is not supported. It will give an
> > exception when you try to convert an UTF-8 or UTF-16 string that has
> > such a character, it will be an error if you pass a surrogate to
> > unichr, or in a \u literal.
> 
> So the characters added in Unicode 3.1 in planes 1, 2, and 14 would
> not be representable in Python? Seems a bit draconian to make your
> life easier.

With Fredrik's solution, you'ld have to rebuild your Python interpreter
with a 32-bit Unicode type to represent the characters. With that
option, we'ld delegate the decision to administrators and Python
distributors. If their users demand support for the additional
characters, they will need to consider wasting space.

Of course, byte code files should then use UTF-16, to allow some
portability of byte code across platforms. If a byte code file
contains a plane 2 string literal, it could not be imported into an
interpreter who uses UCS-2, just as the corresponding source code
import would fail.

Regards,
Martin