[I18n-sig] Support for "wide" Unicode characters

Machin, John JMachin@Colonial.com.au
Fri, 29 Jun 2001 08:08:04 +1000


[John Machin]
> Guido asked:
>    Does UTF-8 transfer isolated surrogates correctly? 
> 
> No. See my bug report in SF. Briefly, a lone high
> surrogate has its leading UTF-8 byte omitted,
> causing an illegal UTF-8 sequence to be generated.
> 
> Here's the URL:
>
http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=43
> 3882
> 
> (or search for "surrogates")

[Guido again]
It's a bug indeed.

But my question was about the definition of UTF8, not our (fallible)
implementation.

What *should* be the result of u'\ud800'.encode('utf8')?
'\xed\xa0\x80' or an exception?

And likewise, what should be the result of unicode('\xed\xa0\x80',
'utf8')?
u'\ud800' or an exception?

(Likewise for low surrogates; currently, u'\udc00'.encode('utf8')
returns '\xed\xb0\x80', but unicode('\xed\xb0\x80', 'utf8') raise an
exception.)

[John Machin]
OK, sorry for the misunderstanding.
A UTF-8 codec can be made to transcode scalars up to at least 31 bits wide.
The ISO 10646 specification allows for this. 

So, for marshalling and (pickling?) purposes, calling the UTF-8 codec with
errors='liberal' would be the way to go. IMO, 'liberal' should still give an
exception for over-long UTF-8 byte sequences -- an encoder which produces
such is broken (either accidentally or deliberately) -- but should happily
transcode any scalar value <= X for some X in (0x10FFFF, 0x7FFFFFFF).

IMO, when errors is 'strict', upper limit should be 0xFFFF for narrow
builds,
and 0x10FFFF for wide builds.

IMO, unicode(), u.encode() and the \U notation should all use 'strict' ...
and
perhaps the exception messages produced by the narrow build could be 
marketing-aligned and point the punter to the wide build.


Cheers,
John


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************