[I18n-sig] Unicode surrogates: just say no!

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 26 Jun 2001 16:53:35 +0200


> Martin has hinted at a solution requiring even less memory per string
> object, but I don't know for sure what he is thinking of.  All I can
> imagine is a single flag saying "this string contains no surrogates".

That was my original idea. I later thought have a count of surrogate
pairs would be better, since it allows to compute len() in constant
time. Indexing would be linear time only for strings containing
surrogates, otherwise constant time also.

> But either way, I believe that this requires that every part of the
> Unicode implementation be changed to become aware of the difference
> between characters and storage units.  Every piece of C code that
> currently deals with indices into arrays of Py_UNICODE storage units
> will have to be changed.

One could try to reduce the impact of the change, in particular when
expecting your solution 3 (i.e. a 32-bit Py_UNICODE). E.g. code that
currently reads

    if (start < 0)
        start += self->length;
    if (start < 0)
        start = 0;

would then read

    if (start < 0)
        start += Py_UNICODE_LENGTH(self);
    if (start < 0)
        start = 0;
    start = Py_UNICODE_UNIT_OF(self,start);

where Py_UNICODE_UNIT_OF converts from character indices to unit
indices, and is implemented as 

#ifdef Py_UNICODE_4_BYTES
#define Py_UNICODE_UNIT_OF(str,x)  x
#else
#define Py_UNICODE_UNIT_OF(str,x)  (str->surrogates?Py_UnicodeUnitOf(str,x):x)
#endif

Not that I particular like that approach; I'm just pointing out it is
feasible.

[on sre]
> There are two parts to this: the internal
> engine needs to realize that e.g. "." and certain "[...]" sets may
> match a surrogate pair, and the indices returned by e.g. the span()
> method of match objects should be translated to character indices as
> expected by the applications.

For character classes, it may be acceptable they must only contain BMP
characters; span would use the conversion macros, and . would need
special casing. I agree this is terrible, but it could work.

> I think the disk space usage problem is dealt with easily by choosing
> appropriate encodings; UTF-8 and UTF-16 are both great space-savers,
> and I doubt many sites will store large amounts of UCS-4 directly,
> given that good codecs are available.

For application data, the internal representation is irrelevant; it is
not easy to get at the internal representation to write a string to a
file (you have to use a codec). For marshal, backward compatibility
becomes an issue; UTF-16 is the obvious choice. For pickle, UTF-8 or
raw-unicode-escape is used, anyway.

> The only remaining question is how to provide an upgrade path to
> option 3:
> 
> A. At some Python version, we switch.
> 
> B. Choose between 1 and 3 based on the platform.
> 
> C. Make it a configuration-time choice.
> 
> D. Make it a run-time choice.
> 
> I hink we all agree that D is bad.  I'd say that C is the best;
> eventually (say, when Windows is fixed :-) the choice becomes
> unnecessary.  I don't think it will be hard to support C, with some
> careful coding.

The biggest danger is that binary C modules are exchanged between
installations, e.g. pyd DLLs or RPMs. With distutils, it is really
easy to create these, so we should be careful that they break
meaningfully instead of just crashing. So I suppose your "careful
coding" includes Py_InitModule magic.

> We could use B to determine the default choice, e.g. we could choose
> between option 1 and 3 depending on the platform's wchar_t; but it
> would be bad not to have a way to override this default, so we
> couldn't exploit the correspondence much.  

Still, exploiting the platform's wchar_t might avoid copies in some
cases (I'm thinking of my iconv codec in particular), so that would
give a speed-up.

> The outcome of the choice must be available at run-time, because it
> may affect certain codecs.  Maybe sys.maxunicode could be the largest
> character value supported, i.e. 0xffff or 0xfffff?

It's actually 0x10ffff, since UTF-16 allows for 16 additional planes,
but yes, that interface sounds good.

Regards,
Martin