[I18n-sig] Re: How does Python Unicode treat surrogates?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 26 Jun 2001 08:02:51 +0200


> > > Say you have a Unicode string which contains the following data:
> > >
> > >        U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066
> > >       ("a"    "b"    "c"    ?      "d"    "e"    "f")
> > >
> > > Would you consider this sequence a Unicode string or not ?
> > 
> > I think you are using "Unicode string" with two different meanings here.
> 
> The question is really very simple: is the above correct Unicode
> or not ?

I think it is not. Looking at Unicode TR 17
(http://www.unicode.org/unicode/reports/tr17/), this is an illegal
sequence of code units. Specifically, they give the example

- 0xD800 is incomplete in Unicode
  Unless followed by another 16-bit value of the right form, it is illegal.

Now what does it mean that this is an illegal code unit sequence?
Looking at Unicode TR 27 (aka Unicode 3.1), we see, for C12

(a) When a process generates data in a Unicode Transformation Format,
    it shall not emit ill-formed code unit sequences.

(b) When a process interprets data in a Unicode Transformation Format,
    it shall treat illegal code unit sequences as an error condition.

(c) A conformant process shall not interpret illegal UTF code unit
    sequences as characters.

So clearly, we shall never emit that Unicode string in a UTF. In
another message, you write

> FYI, Python currently uses UTF-16 as internal storage format and
> also exposes this through its indexing interfaces.

Since Python uses UTF-16 as an internal format, Python must not emit
above Unicode string into the internal representation,
either. Therefore, if Python can represent above sequence of code
units, it is not conforming.

Regards,
Martin