[I18n-sig] Re: How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Mon, 25 Jun 2001 18:46:59 +0200


Mark Davis wrote:
> 
> You cannot interpret isolated UTF-16 surrogate code units as characters. For
> example, you can't interpret the sequence of D800 followed by 0061 as if it
> were some private use character (say, Klingon) followed by an 'a'.
> 
> (For those unfamiliar with the terminology, see
> http://www.unicode.org/glossary, and my paper at
> http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/.)

Thanks for the pointers and the explanations. Your paper is a very
good reading indeed.

My question was targetting into a slightly different direction,
though. I know that UTF-16 does not allow lone surrogates, but 
how does Unicode itself treat these ? If I have a sequence of Unicode
code points which includes an isolated surrogate code point,
would this be considered a legal Unicode sequence or not ?
 
> However, you can certainly deal with surrogate code units in storage, and it
> is permissible on that level to handle them. For example, most UTF-16 string
> interfaces use code unit indices, so that a string from position 3 of length
> 5 will include precisely 5 code units, not however many code points (or
> graphemes!) they take up. Similarly for UTF-8 strings, the low-level units
> are bytes.

FYI, Python currently uses UTF-16 as internal storage format
and also exposes this through its indexing interfaces. In that
sense isolated surrogates would be illegal. The codecs which
convert such Unicode object to other encodings would raise an
exception. Unicode object constructors, slicing and concatenating
Unicode objects currently do not apply any checks though.
 
> In most people's experience, it is best to leave the low level interfaces
> with indices in terms of code units, then supply some utility routines that
> tell you information about code points. 

So surrogate support or its handling is left to the applications
using the interface ?! Perhaps you are right and this is the only
feasable way to approach the problem...

> The most useful are:
> 
> - given a string and an index into that string, how many code points are
> before it?
> - given a string and a number of code points, what is the lowest index that
> contains them?
> - given a string and an index into that string, is the index on a code point
> boundary?

These are still missing in Python; we should probably add methods
for them in one of the next releases, though.
 
> An example for Java is at
> http://oss.software.ibm.com/icu4j/doc/com/ibm/text/UTF16.html.
> 
> Mark
> 
> ----- Original Message -----
> From: "Gaute B Strokkenes" <gs234@cam.ac.uk>
> To: "M.-A. Lemburg" <mal@lemburg.com>
> Cc: "Tim Peters" <tim.one@home.com>; <i18n-sig@python.org>;
> <unicode@unicode.org>
> Sent: Monday, June 25, 2001 05:03
> Subject: Re: How does Python Unicode treat surrogates?
> 
> >
> > [I'm cc:-ing the unicode list to make sure that I've gotten my
> > terminology right, and to solicit comments
> >
> > On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> > > Tim Peters wrote:
> > >>
> > >> [M.-A. Lemburg]
> > >> > ...
> > >> > 2. What to do when slicing of Unicode strings would break
> > >> >    a surrogate pair ?
> > >>
> > >> To me a string is a sequence of characters, and s[0] returns the
> > >> first, s[1] the second, and so on.  The internal details of how the
> > >> implementation chooses to torture itself <0.7 wink> should be
> > >> invisible.  That is, breaking a surrogate via slicing should be
> > >> impossible: s[i:j] returns j-i characters, and that's that.
> > >
> > > It's not that simple: lone surrogates are true Unicode char points
> > > in their own right; it's just that they are pretty useless without
> > > their resp. partners in the data stream. And with this "feature"
> > > they are in good company: the Unicode combining characters (e.g. the
> > > combining acute) have th same property.
> >
> > This is completely and totally wrong.  The Unicode standard version
> > 3.1 states (conformance requirement C12(c): A conformant process shall
> > not interpret illegal UTF code unit sequences as characters.
> >
> > The precise definition of "illegal" in this context is given
> > elsewhere.  See <http://www.unicode.org/unicode/reports/tr17/>:
> >
> >   0xD800 is incomplete in Unicode.  Unless followed by another 16-bit
> >   value of the right form, it is illegal.
> >
> > (Unicode here should read UTF-16, off course.  The reason it does not
> > is that the language of the technical report has not been updated to
> > that of 3.1)
> >
> > --
> > Big Gaute                               http://www.srcf.ucam.org/~gs234/
> > Hello?  Enema Bondage?  I'm calling because I want to be happy, I guess..
> >
> >
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/