[I18n-sig] How does Python Unicode treat surrogates?

Guido van Rossum guido@digicool.com
Mon, 25 Jun 2001 14:16:40 -0400


> Guido van Rossum writes:
> > I'm sorry, but I don't see why it's UCS-2 any more or less than
> > UTF-16.  That's like arguing whether 8-bit strings contains ASCII or
> > UTF-8.  That's up to the application; the data type can be used for
> > either.
> 
> UCS-2 and UTF-16 and UTF-8 are encoding forms of Unicode. Unicode
> defines characters using an abstract integer, the code-point. As of
> Unicode 3.1 code points range from 0x000000 to 0x10FFFF.
> 
> The so-called Unicode string type in Python is a wide-string type,
> where each character is treated as a 16-bit quantity. The
> interpretation placed on those 16-bit quantities is that of UCS-2. In
> that case each half of a surrogate pair is an unknown character.

So far we agree.

> As soon as you impose UTF-16 semantics on the 16-bit quantities, then
> you need to treat surrogate pairs as a single character.
> 
> If the implementation won't change, then the standard library needs to
> support surrogates as a wrapper: leaving it up to each application is
> a mistake. IMHO you cannot trust implementers to do this right.

Sure, someone can add a module that provides surrogate support using
the standard Unicode datatype.

> > But unless I misunderstand what it *is* that you are suggesting, the
> > O(1) indexing property can't be retained with your suggestion, and
> > that's out of the question.
> 
> You understand me completely. Adding transparent UTF-16 support
> changes your O(1) indexing operation to O(1+c), where 'c' is the small
> amount of time required to check for the surrogate. Granted, this 'c'
> could get large, but...

I don't think there is such a thing as "O(1+c) for small c".

To extract the n'th Unicode character you would have to loop over all
the preceding characters checking for surrogates.  This makes it O(n).

It's a common Python idiom to read megabytes of text into a single
(8-bit or 16-bit) string object, so changing O(1) to O(n) is a real
problem!

> But I see your point: this requirement is what prompted the glibc
> folks to go with the 32-bit wchar_t type.
> 
> > That turned out to be a myth, actually.  mod_python works fine with
> > threads on most platforms.
> 
> Not in my experience. On my FreeBSD box Python 2.0 built with threads
> does not get along in some cases where Apache 1.3.19. Not that it matters.

FreeBSD happens to be one of those platforms. :-(

Has to do with the fact that on *BSD you link with a different version
of the C library to enable threads, and since Apache is linked with
the unthreaded version, any versions of Python embedded in Apache must
also be unthreaded.

--Guido van Rossum (home page: http://www.python.org/~guido/)