[I18n-sig] Unicode surrogates: just say no!

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 26 Jun 2001 21:43:12 +0200


> > From: Tom Emerson <tree@basistech.com>
> 
> > UTF-8 can be used to encode encode each half of a surrogate pair
> > (resulting in six-bytes for the character) --- a proposal for this was
> > presented by PeopleSoft at the UTC meeting last month. UTF-8 can also
> > encode the code-point directly in four bytes.
> 
> But isn't the direct encoding highly preferable?  When would you ever
> want your UTF-8 to be encoded UTF-16?

Somebody please correct me: A conforming implementation must never
encode a non-BMP character with six bytes in UTF-8; security people
will shoot you if you say that two alternative representations for the
same string are possible.

HOWEVER, I think what the spec says that implementation shall accept
to receive non-BMP characters encoded in six bytes UTF-8. This is
because buggy implementations may produce such output, and because
that was previously left unspecified, so accepting such UTF-8 strings
improves interoperability.

> Huh?  Marshal uses UTF-8 now.

Oops, I should have checked.

Regards,
Martin