[I18n-sig] UCS-4 configuration

Guido van Rossum guido@digicool.com
Tue, 26 Jun 2001 19:34:16 -0400


Wow, this is so cool!  Seems we don't need a PEP...  Just an update to
the NEWS file and some changes to the docs and test suite.

> > looks like your patch doesn't support sizeof(short) > 2 (e.g. cray).
> > except for that, it's not too different from what I was working on.
> 
> Indeed it doesn't. How are you going to solve this? Generating
> UCS-2/UTF-16 when you have no two-byte type is not easy, unless you
> plan to do all byte operations yourself.

Don't be a wimp. :-)

As Tim Peters keeps pointing out, it's really not that hard to write
such code, e.g. using the occasional mask operation.  And a good
compiler will remove the masks that don't do anything.

> Anyway, at the moment, it is a compile time error if short is not two
> bytes. I hope I found all places where Py_UCS2 should be used.

Me too.  I hope for the Cray folks that short will be allowede to vary
properly.

Another loose end: define sys.maxunicode.

> Regards,
> Martin
> 
> P.S. This patch makes the test suite fail in four byte mode, when
> trying to check the output of u'\ud800\udc02'.encode('utf-8'). IMO,
> all literals denoting surrogates should be replaced with \U
> literals in test_unicode; this is not done yet.

Here's another weird failure in 4-byte mode, with a manually
constructed surrogate pair (using marshal, but direct use of
u.encode('utf8') would show the same problem):

>>> u = u'\ud800\udc00'
>>> u
u'\ud800\udc00'
>>> len(u)
2
>>> import marshal
>>> s = marshal.dumps(u)
>>> s
'u\x06\x00\x00\x00\xed\xa0\x80\xed\xb0\x80'
>>> marshal.loads(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: illegal encoding
>>> 

Note how the utf8 codec has encoded the surrogate pair as two 3-byte
utf8 sequences.  I think it should either spit out an error or (I
think this is better -- "be forgiving in what you accept") recognize
the surrogate pair and spit out a 4-byte utf8 sequence.  Note that in
2-byte mode, this same string literal can be marshalled and
unmarshalled just fine!

I think I'm going to withdraw my recommendation that in 4-byte mode \U
and unichr() would accept any 32-bit value; the use of UTF8 by marshal
effectively rules this out.

Or should we change the marshalling format to do something that's more
transparent?  It feels uncomfortable that in 2-byte mode we can easily
create unicode strings containing illegal sequences (e.g. lone
surrogates), but these strings can't be marshalled.  Marshal has no
business being judgemental about the value of the data.

I think we can work out most of the backward compatibility issues by
switching to a new marshal tag byte (e.g. 'U').

PS. I checked in a tiny improvement to the unichr() code.

--Guido van Rossum (home page: http://www.python.org/~guido/)