[I18n-sig] Re: How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Mon, 25 Jun 2001 14:56:23 +0200


"Machin, John" wrote:
> 
> MAL and Gaute,
> 
> Can I please take the middle ground (and risk having both of you throw
> things at me?

Sure :-)
 
> => Lone surrogates are not 'true Unicode char points
>  in their own right' [MAL] -- they don't represent characters.

I should have added "please correct me if I'm wrong", sorry.

Let me put this into an example:
Say you have a Unicode string which contains the following data:

        U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066
       ("a"    "b"    "c"    ?      "d"    "e"    "f")

Would you consider this sequence a Unicode string or not ? Please
note that I am not talking about some UTF-n encoding here. The
above snippet is simply to be seen as sequence of data entries
which are referenced by the Unicode database.

> On the other hand, UTF code sequences that would decode into lone surrogates
> are not "illegal".
> Please read clause D29 in section 3.8 of the Unicode 3.0 standard. This is
> further clarified by Unicode 3.1
> which expressly lists legal UTF-8 sequences; these encompass lone
> surrogates.
> 
> -----Original Message-----
> From: Gaute B Strokkenes [mailto:gs234@cam.ac.uk]
> Sent: Monday, 25 June 2001 22:04
> To: M.-A. Lemburg
> Cc: Tim Peters; i18n-sig@python.org; unicode@unicode.org
> Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
> 
> [I'm cc:-ing the unicode list to make sure that I've gotten my
> terminology right, and to solicit comments
> 
> On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> > Tim Peters wrote:
> >>
> >> [M.-A. Lemburg]
> >> > ...
> >> > 2. What to do when slicing of Unicode strings would break
> >> >    a surrogate pair ?
> >>
> >> To me a string is a sequence of characters, and s[0] returns the
> >> first, s[1] the second, and so on.  The internal details of how the
> >> implementation chooses to torture itself <0.7 wink> should be
> >> invisible.  That is, breaking a surrogate via slicing should be
> >> impossible: s[i:j] returns j-i characters, and that's that.
> >
> > It's not that simple: lone surrogates are true Unicode char points
> > in their own right; it's just that they are pretty useless without
> > their resp. partners in the data stream. And with this "feature"
> > they are in good company: the Unicode combining characters (e.g. the
> > combining acute) have th same property.
> 
> This is completely and totally wrong.  The Unicode standard version
> 3.1 states (conformance requirement C12(c): A conformant process shall
> not interpret illegal UTF code unit sequences as characters.
> 
> The precise definition of "illegal" in this context is given
> elsewhere.  See <http://www.unicode.org/unicode/reports/tr17/>:
> 
>   0xD800 is incomplete in Unicode.  Unless followed by another 16-bit
>   value of the right form, it is illegal.
> 
> (Unicode here should read UTF-16, off course.  The reason it does not
> is that the language of the technical report has not been updated to
> that of 3.1)
> 
> --
> Big Gaute                               http://www.srcf.ucam.org/~gs234/
> Hello?  Enema Bondage?  I'm calling because I want to be happy, I guess..
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig
> 
> **************   IMPORTANT MESSAGE  **************
> 
> The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.
> 
> **************************************************

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/