[I18n-sig] Re: Unicode surrogates: just say no!

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 27 Jun 2001 19:06:30 +0200


> Martin v. Loewis" <martin@loewis.home.cs.tu-berlin.de> wrote:
> 
> > It seems to be unclear to many, including myself, what exactly was
> > clarified with Unicode 3.1. Where exactly does it say that processing
> > a six-byte two-surrogates sequence as a single character is
> > non-conforming?
> 
> It's not non-conforming, it's "irregular". 

If some implementation processes something, it can be either
conforming or non-conforming doing so, no? The byte sequence itself
may be irregular; I'm asking how a conforming implementation should
deal with it when it sees it.

> Please read the technical report (#27) that I pointed at yesterday
> (on the i18n-sig@python).  It gives detailed specifications for
> UTF-8.  Anything not in the table "UTF-8 Bit Distribution" and
> accompanying description shown there is non-conforming.

I see conformant/non-conformant (*) only used for implementations (and
processes), not for byte sequences. There you use illegal, ill-formed,
irregular; much of my confusion probably is because I don't know how
these terms relate, except for

- an irregular sequence (of bytes, or code units) is not illegal.

Also, I assume that negation of these concepts follows the English
language rules (i.e. "not illegal" == "legal", "not ill-formed" ==
"well-formed", etc)

> In other words, it is non-conforming to generate two 3-byte things for a  
> surrogate pair.  However, it remains "legal but irregular" to interpret  
> such a pair of 3-byte entities.
[...]
> If you still find the definitions and discussion in the technical report  
> to be unclear, then the Unicode editorial committee would undoubtedly like  
> to hear about it.

The issue of UTF-8 encoded surrogate pairs is clear now to me, I hope:
You must not write them, but you may read them.

The next question then is what to do with lone surrogate triplets; the
table in TR 27 suggests they are legal, but people on this list have
argued they must neither be emitted nor consumed (since what you get
is not a legal USV).

Thanks for your comments,
Martin

(*) "Conforming" is never used, sorry for the confusion