[I18n-sig] Unicode surrogates: just say no!

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 27 Jun 2001 23:00:18 +0200


> When using UCS-4 mode, I was in favor of allowing unichr() and \U to
> specify any value in range(0x100000000L), but that's not what Martin
> and Fredrik checked in.  Note that if C code somehow creates a UCS-4
> string containing something with the high bit on, ord() will currently
> return a negative value on platforms where a C long is 32 bits.

Couldn't it be an unenforced rule that C code also must stick to the
17 planes? There are many unenforced rules, like that you must not
modify a string unless you've created it by passing a NULL char*, and
not handed out a reference to anybody.

Effectively, using C code might introduce undefined behaviour. On some
systems, ord will return a negative value, on others, a positive one;
in a future version, it may produce an error if we find too many
people had problems with their C code writing large integers into
unicode characters.

Regards,
Martin