[I18n-sig] Re: How does Python Unicode treat surrogates?

Mon, 25 Jun 2001 18:23:10 +0200

"Machin, John" wrote:
> 
> > Say you have a Unicode string which contains the following data:
> >
> >        U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066
> >       ("a"    "b"    "c"    ?      "d"    "e"    "f")
> >
> > Would you consider this sequence a Unicode string or not ?
> 
> I think you are using "Unicode string" with two different meanings here.

The question is really very simple: is the above correct Unicode
or not ?

> However, the pragmatic question is what should Python do when given such a
> sequence.
> Do we permit such a sequence to be held internally as a "Unicode string"?
> Is u"\udc00" legal in source code or should Python throw a syntax error?
> Same question for u"\uffff".

Right... that's what I was getting at. 

The Unicode object in Python
represent a "Unicode string"; the underlying logic is really secondary,
the question here is whether construction of objects like u"\uFFFF"
should be possible or not. 

If the standards defines these as illegal
Unicode, then the constructors should make sure that construction of
these objects is not possible; otherwise, it should work on them
just like all other "code points". (http://www.unicode.org/glossary/)

> We *do* need to consider UTF encodings, because Unicode *expressly* allows
> decoding UTF sequences
> that become unpaired surrogates, or other "not 100% valid" scalars such as
> 0xffff and 0xfffe.

The standard says this on the noncharacter code points:

"""
 D7b
         Noncharacter: a code point that is permanently reserved for internal use,
         and that should never be interchanged. In Unicode 3.1, these consist of
         the values U+nFFFE and U+nFFFF (where n is from 0 to 1016) and the
         values U+FDD0..U+FDEF.

  C5
        A process shall not interpret a noncharacter
        code point as an abstract character.

        The code points may be used internally, such as for sentinel values or
        delimiters, but should not be exchanged publicly. 

C10
         A process shall make no change in a valid coded character representation
         other than the possible replacement of character sequences by their
         canonical-equivalent sequences or the deletion of noncharacter code
         points, if that process purports not to modify the interpretation of that
         coded character sequence.

        If a noncharacter which does not have a specific internal use is
        unexpectedly encountered in processing, an implementation may signal an
        error or delete or ignore the noncharacter. If these options are not taken,
        the noncharacter should be treated as an unassigned code point. For
        example, an API that returned a character property value for a noncharacter
        would return the same value as the default value for an unassigned code
        point. 
"""

Note that lone surrogates are not regarded as noncharacters (for some
reason).

> So,
> given that Python supports Unicode, not ISO 10646, we must IMO permit such
> sequences in our internal
> representation. It follows that we should stop worrying about these
> irregular values -- it's less
> programming that way. Unicode 3.1 will create enough extra programming as it
> is, because we now have
> variable-length characters again -- just what Unicode was going to save us
> from :-(

Agreed; now who's going to submit the patches ;-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/