[I18n-sig] Re: How does Python Unicode treat surrogates?

Mark Davis mark@macchiato.com
Mon, 25 Jun 2001 11:18:52 -0700


comments below.

----- Original Message -----
From: "M.-A. Lemburg" <mal@lemburg.com>
To: "Mark Davis" <mark@macchiato.com>
Cc: "Gaute B Strokkenes" <gs234@cam.ac.uk>; "Tim Peters" <tim.one@home.com>;
<i18n-sig@python.org>; <unicode@unicode.org>
Sent: Monday, June 25, 2001 09:46
Subject: Re: [I18n-sig] Re: How does Python Unicode treat surrogates?


[snip]
>
> My question was targetting into a slightly different direction,
> though. I know that UTF-16 does not allow lone surrogates, but
> how does Unicode itself treat these ? If I have a sequence of Unicode
> code points which includes an isolated surrogate code point,
> would this be considered a legal Unicode sequence or not ?

It is a legal Unicode code point sequence. However, it is not a legal
Unicode *character* sequence, since it contains code points that by
definition cannot be used to represent characters.

>
> > However, you can certainly deal with surrogate code units in storage,
and it
> > is permissible on that level to handle them. For example, most UTF-16
string
> > interfaces use code unit indices, so that a string from position 3 of
length
> > 5 will include precisely 5 code units, not however many code points (or
> > graphemes!) they take up. Similarly for UTF-8 strings, the low-level
units
> > are bytes.
>
> FYI, Python currently uses UTF-16 as internal storage format
> and also exposes this through its indexing interfaces. In that
> sense isolated surrogates would be illegal. The codecs which
> convert such Unicode object to other encodings would raise an
> exception.

> Unicode object constructors, slicing and concatenating
> Unicode objects currently do not apply any checks though.

That is what is typically done, since using codepoint indices on each
operation is a very significant performance burden.

Mark