[I18n-sig] Unicode surrogates: just say no!

Tue, 26 Jun 2001 13:00:44 -0400

(Mass followup.)

> From: "M.-A. Lemburg" <mal@lemburg.com>

> The UTF-16 decoder will raise an exception if it sees a surrogate.
> The encoder write the internal format as-is without checking for
> surrogate usage.

Hm, isn't this asymmetric?  I'd imagine that either behavior
(exception or copy as-is) can useful in either direction at times, so
this should be an option (maybe a different codec name?).

> The UTF-8 codec is fully surrogate aware and will translate
> the input into UTF-16 surrogates if necessary. The encoder
> will translate UTF-16 surrogates into UTF-8 representations
> of the code point.

Good.  This (like the UTF-16 codec's behavior) will have to be made
conditional on sizeof(Py_UNICODE) in my proposal.

> As Mark Davis told me, isolated surrogates are legal code
> points, but the resulting sequence is not a legal Unicode
> character sequence, sinde these code point (like a few others
> as well) are not considered characters.

Let me use this as an excuse to start a discussion on how far we
should go in ruling out illegal code points.

I think that *codecs* would be wise to be picky about illegal code
points (except for the special UTF-16-pass-through option).

But I think that the *datatype implementation* should allow storage
units to take every possible value, whether or not it's illegal
according to Unicode, either in isolation or in context.  It's much
easier to implement that way, and I believe that the checks ought to
be in other tools.

In particular, I propose:

- in all cases:

  - \udddd and \Udddddddd always behave the same as unichr(0xdddd) or
    unichr(0xdddddddd)

- with 16-bit (narrow) Py_UNICODE:

  - unichr(i) for 0 <= i <= 0xffff always returns a size-one string
    where ord(u[0]) == i

  - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
    and \U) generates a surrogate pair, where u[0] is the high
    surrogate value and u[1] the low surrogate value

  - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U)
    raises an exception at Python-to-bytecode compile-time

- with 32-bit (wide) Py_UNICODE:

  - unichr(i) for 0 <= i <= 0xffffffff always returns a size-one
    string where ord(u[i]) == i

I expect that the surrogate generation rule will be controversial, so
let me explain why I think it's the best possible rule.  We're adding
a difference between Python implementations here: some can only
represent code points up to 0xffff directly, others can represent all
32-bit code points.  This is no different (IMO) than having sys.maxint
vary between platforms, or having thread support be platform
dependent, or having several choices from the *dbm family of modules.
We'll tell users their platform properties: sys.maxunicode is either
0xffff or 0x10ffff.

Users can choose to write code that only runs with wide Unicode
strings.  They ought to put "assert sys.maxunicode>=0x10ffff"
somewhere in their program, but that's their choice -- they can also
just document it, or only run it on their own system which they
configured for wide Unicode.

Users can choose to write code that doesn't use Unicode characters
outside the basic plane.  They don't have to do anything special.

Users can choose to write code that's portable between the two
versions by using surrogates on the narrow platform but not on the
wide platform.  (This would be a good idea for backward compatibility
with Python 2.0 and 2.1 anyway.)  The proposed (and current!) behavior
of \U makes it easy for them to do the right thing with string
literals; everything else, they just have to write code that won't
separate surrogate halves.

Making unichr() and the \U escape behave the same regardless of
platform makes more sense than the current situation, where unichr()
refuses characters larger than 0xffff, but \U translates them into
surrogates.

I *don't* think \U should be limited to a notation to create
surrogates.

I also don't think it's wise to stop creating surrogates from \U when
appropriate.

I *don't* think it's wise to let unichr() balk at input values that
happen to be lone surrogates.  It is easy enough to avoid these in
applications (if the application gets its input from a codec, it
should be safe already), and it would prevent code that knows what
it's doing to do stuff beyond the Unicode standard du jour.  That
would be unpythonic.

> After all this discussion and the feedback from the Unicode
> mailing list, I think we should leave surrogate handling
> solely to the codecs and not deal with them in the internal
> storage. That is, it is the applications responsability to
> make sure to create proper sequences of code points which can
> be used as character sequences. 

Exactly what I say above.

> The codecs, OTOH, should be aware of what is and what is not
> considered a legal sequence. The default handling should be to
> follow the Unicode Consortium standard. If someone wants to
> have additional codecs which implement the ISO 10646 view of things
> with respect to UTF-n handling, then these can easily be supported
> by codec extensions packages.

Yes.

> >    We
> >    could make it hard by declaring unichr(i) with surrogate i and \u
> >    and \U escapes that encode surrogates illegal, and by adding
> >    explicit checks to codecs as appropriate, but a C extension could
> >    still create an array containing illegal characters unless we do
> >    draconian input checking.
> 
> See above: it's better to leave these decisions to the applications
> using the Unicode implementation.

We agree!

> > ...choose option 3...
> >
> > The only remaining question is how to provide an upgrade path to
> > option 3:
> > 
> > A. At some Python version, we switch.
> 
> Like Fredrik said: as soon as the implementation is ready.

But will the users be ready?

> > B. Choose between 1 and 3 based on the platform.
> > 
> > C. Make it a configuration-time choice.
> > 
> > D. Make it a run-time choice.
> 
> I'd rather not make it a choice: let's go with UCS-4 and be
> done with these problems once and for all !

I assert that it's easy enough to write code that is indifferent to
sizeof(Py_UNICODE).  See SRE as a proof.

I expect that not all Unicode users will be ready to embrace UCS-4.  I
don't want to hear people say "I don't want to upgrade to Python 2.2
because it wastes 4 bytes per Unicode character, but all I ever do is
bandy around basic plane characters.  Given that there's currently
very limited need for characters outside the basic plane, I want to be
able to say that Python 2.2 is UCS-4 ready, but not that it always
uses it.

> As side effect, you could then also enjoy Unicode on Crays :-)

Indeed.

> Instead of adding an option which allows selecting between
> 2 or 4 bytes per code unit, I think people would rather like
> to see for disabling Unicode support completely (I know that 
> the Pippy Team would :-).

That's definitely another configuration switch that I would like to
see.  How hard would it be?

> From: Toby Dickenson <tdickenson@devmail.geminidataloggers.co.uk>

> In previous discussion about unifying plain strings an unicode
> strings, someone (I forget who, sorry) proposed that a unified string
> type that would store its data in arrays of either 1 or 2 byte
> elements (depending what was efficient for each string) but provide a
> unified interface independant of storage option.
> 
> Could the same option be used to support an option E, individual
> strings use UCS-4 if they have to, but otherwise gain the space
> advantages of UCS-2?

I agree with MAL's rebuttal: this would just make things more
complicated all over the place.

> From: Tom Emerson <tree@basistech.com>

> UTF-8 can be used to encode encode each half of a surrogate pair
> (resulting in six-bytes for the character) --- a proposal for this was
> presented by PeopleSoft at the UTC meeting last month. UTF-8 can also
> encode the code-point directly in four bytes.

But isn't the direct encoding highly preferable?  When would you ever
want your UTF-8 to be encoded UTF-16?

> As Marc-Andre said in his response, you can have a valid stream of Unicode
> characters with half a surrogate pair: that character, however, is
> undefined.

I guess the UTF-8 codec would have to deal with unpaired surrogates
somehow, but I would prefer it if normally it would peek ahead and
encode a valid surrogate pair as the correct 4-byte sequence.

> > I see only one remaining argument against choosing 3 over 2: FUD about
> > disk and promary memory space usage.
> 
> At the last IUC in Hong Kong some developers from SAP presented data
> against the use of UCS-4/UTF-32 as an internal representation. In
> their benchmarks they found that the overhead of cache-misses due to
> the increased character width were far more detrimental to runtime
> than having to deal with the odd surrogate pair in a UTF-16 encoded
> string. After the presentation several people (myself, Asmus Freytag,
> Toby Phipps of PeopleSoft, and Paul Laenger of Software AG) had a
> little chat about this issue and couldn't agree whether this was
> really a big problem or not. I think it bears more research.

Yet another reason to offer a configuration choice between 2-byte and
4-byte Py_UNICODE, until we know the answer.  (I'm sure it depends on
what the application does with the data too!)

> However, I agree that using UCS-4/UTF-32 as the internal string
> representation is the best solution.

Well, I find it infinitely better than trying to use UTF-16 as the
internal representation but coercing the interface into dealing with
characters and character indices uniformally.

> Remember too that glibc uses UCS-4 as its internal wchar_t
> representation. This was also discussed at the Li18nux meetings a
> couple of years ago.

But I don't think there are many Linux applications that use wchar_t
extensively yet.  At least I haven't seen any.  (Does anyone know if
Mozilla's Asian character support uses wchar_t or Unicode?)

> > A. At some Python version, we switch.
> > 
> > B. Choose between 1 and 3 based on the platform.
> > 
> > C. Make it a configuration-time choice.
> 
> Defaulting to UCS-4?

Unclear.  We'll have to user-test this default and see what the
performance hit really is.

> > We could use B to determine the default choice, e.g. we could choose
> > between option 1 and 3 depending on the platform's wchar_t; but it
> > would be bad not to have a way to override this default, so we
> > couldn't exploit the correspondence much.  Some code could be
> > #ifdef'ed out when Py_UNICODE == wchar_t, but there would always have
> > to be code to support these two having different sizes.
> 
> Seems to me this could add complexity and reliance on platform
> functionality that may not be consistent. Is the savings worth the
> complexity?

Given that the benefits of UCS-4 are unclear at this point, I think we
should be cautious and support both UCS-2 and UCS-4 on all platforms
(except maybe Crays :-).

> > The outcome of the choice must be available at run-time, because it
> > may affect certain codecs.  Maybe sys.maxunicode could be the largest
> > character value supported, i.e. 0xffff or 0xfffff?
> 
> or 0x10ffff?

Yes, I forgot about the 17th plane.

> From: "M.-A. Lemburg" <mal@lemburg.com>

> From: "Martin v. Loewis" <martin@loewis.home.cs.tu-berlin.de>

[sketches implementation idea]
> Not that I particular like that approach; I'm just pointing out it is
> feasible.

I still find this approach very unattractive, and I doubt that it will
be possible to make all aspects of the interface uniform.  What would
be a good reason to try this?  It's by far the most work of all
options.

> [on sre]
> For character classes, it may be acceptable they must only contain BMP
> characters; span would use the conversion macros, and . would need
> special casing. I agree this is terrible, but it could work.

I doubt that Fredrik would want to maintain it.

> > I think the disk space usage problem is dealt with easily by choosing
> > appropriate encodings; UTF-8 and UTF-16 are both great space-savers,
> > and I doubt many sites will store large amounts of UCS-4 directly,
> > given that good codecs are available.
> 
> For application data, the internal representation is irrelevant; it is
> not easy to get at the internal representation to write a string to a
> file (you have to use a codec). For marshal, backward compatibility
> becomes an issue; UTF-16 is the obvious choice. For pickle, UTF-8 or
> raw-unicode-escape is used, anyway.

Huh?  Marshal uses UTF-8 now.  Since the UTF-8 codec is already fully
surrogate-aware, shouldn't it do the right thing?  E.g. on a "narrow"
platform, encoding a Unicode string containing a surrogate pair
generates the UTF-8 4-byte encoding of the corresponding Unicode
character, and decoding that UTF-8 representation will create a
surrogate pair.  On a wide platform, that same UTF-8 encoding will be
turned into a single character correctly (assuming the UTF-8 codec
is adapted to the wide platform; I presume this code doesn't exist
yet).  So if either platform takes string literal containing a \U
escape for a non-basic-plane character, and marshals the resulting
string, they get the same marshalled value, and they can both read it
back correctly.  (Try it!  It works.)

> The biggest danger is that binary C modules are exchanged between
> installations, e.g. pyd DLLs or RPMs. With distutils, it is really
> easy to create these, so we should be careful that they break
> meaningfully instead of just crashing. So I suppose your "careful
> coding" includes Py_InitModule magic.

Good point!

> Still, exploiting the platform's wchar_t might avoid copies in some
> cases (I'm thinking of my iconv codec in particular), so that would
> give a speed-up.

Yes, but I don't want to *force* users to use UCS-4.  (Yet; in a few
years time this may change.)

We have this code now, so it shouldn't be too hard to keep it.

PEP time?

--Guido van Rossum (home page: http://www.python.org/~guido/)