[I18n-sig] How does Python Unicode treat surrogates?

Machin, John JMachin@Colonial.com.au
Sun, 24 Jun 2001 10:09:34 +1000


Hello there,

I'm the 'nobody' who raised the SF bug report to which Martin refers.

According to Unicode 3.0, transformations between scalars and UTF-n
should provide lossless round-trip transcoding, even for invalid scalars
like
unpaired surrogates and values like 0xFFFE and 0xFFFF.

Unicode 3.1 adds further clarification by listing out what are legal
byte sequences for UTF-8; these include byte sequences that encompass
those invalid scalars.

There is a note in the Unicode docs that ISO/IEC 10646 ("ISO" for short)
forbids this permissive treatment of invalid scalars.

The implementation in Python 2.1 does this:

encoding to UTF-8:
  0xFFFF etc: Unicode-compliant
  unpaired low surrogate: Unicode-compliant
  unpaired high surrogate: *BUG*, generates invalid UTF-8 byte sequence
decoding from UTF-8:
  0xFFFF etc: Unicode-compliant
  unpaired surrogates: ISO-compliant 

In a note that Martin added to my bug report, he seems to be
advocating ISO compliance.

My two-cents-worth on approach to differences between Unicode
and ISO:

Unicode is the *practical* standard. Unicode is the
*available* standard -- you can buy the book; you can access
the web site. Martin said in his note to my bug report that
he doesn't have a copy of the ISO document(s); he's not alone!

Python advertises Unicode support, not ISO/IEC 10646 support.
If we make the transcoding of invalid scalars ISO-compliant,
then we should document and justify this. We should do this
for *all* invalid scalars, not just unpaired surrogates.

Perhaps the effort that would be required to do all the 
explicit testing to make all the transcoders ISO-compliant 
would be better directed into providing a function or method
that checked a Unicode string for the presence of invalid scalars.

A very practical point: Fixing the invalid-byte-sequence bug involves
adding two or three lines of code. Making the UTF-8 decoder
Unicode-compliant involves removing half a line of code. Minimal
effort and no documentation and justifications required.

Hmmm, 4 cents worth by the end of the rant :-)
Anyway, hope this helps,
John


-----Original Message-----
From: Martin v. Loewis [mailto:martin@loewis.home.cs.tu-berlin.de]
Sent: Sunday, 24 June 2001 8:19
To: mal@lemburg.com
Cc: guido@digicool.com; i18n-sig@python.org
Subject: Re: [I18n-sig] How does Python Unicode treat surrogates?


> > Likewise, but somewhat more troubling, surrogates that straddle write
> > invocations are not processed properly.
> > 
> > >>> s=StringIO.StringIO()
> > >>> _,_,r,w=codecs.lookup("utf-8")
> > >>> f=w(s)
> > >>> f.write(u"\ud800")
> > >>> f.write(u"\udc00")
> > >>> f.flush()
> > >>> s.getvalue()
> > '\xa0\x80\xed\xb0\x80'
> > 
> > whereas the correct answer would have been
> > 
> > >>> u"\ud800\udc00".encode("utf-8")
> > '\xf0\x90\x80\x80'
> 
> This is a special case of the above (since the encoder will
> see truncated surrogates and should raise raise an exception 
> for these).

I don't think it should; it is not truncated since a later write call
will provide the missing word. If you have a Unicode stream, it should
be possible to read the stream contents in arbitrary chunks of works,
and encode it with a stream encode. 

The stream encoder should produce the same output no matter how you
split the input. Under your proposed behaviour, this is not the case.

Please note that

http://sourceforge.net/tracker/index.php?func=detail&aid=433882&group_id=547
0&atid=105470

adds a few other aspects to the problem: It appears that Unicode 3.1
specifies that certain forms of UTF-8 encoded surrogates are merely
irregular, not illegal. There may be some misinterpretation of the
spec in this report, but I think all this needs careful checking.

Regards,
Martin


_______________________________________________
I18n-sig mailing list
I18n-sig@python.org
http://mail.python.org/mailman/listinfo/i18n-sig


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************