[I18n-sig] Unicode surrogates: just say no!

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Thu, 28 Jun 2001 08:20:58 +0200


> What is the virtue in making the literal syntax easy and making unichr()
> easy when everything else is hard? Counting characters is hard.
> Addressing characters reliably is hard. Slicing reliably is hard. Why
> not simplify things? Surrogates are just characters. If you want to
> handle wide characters you need to build Python that way.
> 
> I'm trying to imagine the use-case where you care about surrogates
> enough to want them to be automatically generated but not enough to care
> about slicing and addressing and counting and ...and is this use-case
> worth breaking the invariant that len(unichr(i))==1.

I'm in favour of supporting the \U notation to denote non-BMP
characters even in a "narrow" installation. Whether unichr should also
support them is less interesting, but it gives some consistency if it
does.

The rationale for supporting \U is two-fold: One, importing a module
should not fail in one installation, and succeed in another (of the
same Python version). Running the module may give different results,
but you should be able to generate byte code. Furthermore, people
using non-BMP characters in source are probably not very interested in
counting the characters: They want to display them. For just
displaying them, you need to represent them, and you need the fonts.
String manipulation is less important.

Regards,
Martin