[I18n-sig] UCS-4 configuration

Guido van Rossum guido@digicool.com
Wed, 27 Jun 2001 11:20:14 -0400


> > Another loose end: define sys.maxunicode.
> 
> Breaking my promise not to touch the code, I've added this. I was not
> quite sure what type you meant to see in sys.maxunicode; I took
> integer, since U+FFFF is a non-character.

Correct.  And thanks!

> > Note how the utf8 codec has encoded the surrogate pair as two 3-byte
> > utf8 sequences.  I think it should either spit out an error or (I
> > think this is better -- "be forgiving in what you accept") recognize
> > the surrogate pair and spit out a 4-byte utf8 sequence.  Note that in
> > 2-byte mode, this same string literal can be marshalled and
> > unmarshalled just fine!
> 
> That was actually the same problem as with the test case: the UTF-8
> encoder would not use the surrogate code in wide mode. I've removed
> that restriction, so this test now also passes.

Thanks again!

> > Or should we change the marshalling format to do something that's more
> > transparent?  It feels uncomfortable that in 2-byte mode we can easily
> > create unicode strings containing illegal sequences (e.g. lone
> > surrogates), but these strings can't be marshalled.  
> 
> You mean, they cannot be unmarshalled? With the current code,
> marshalling them works fine...

Yes.

> There was another problem with the unicode database; the code assumed
> that adding two Py_UNICODE values would wrap around at 65536. With
> that fixed and committed, the test suite passes for me.

Wow.  And for both versions, too!

Are there any open issues left?  A list of those would help!  Some I
can think of:

- Marc-Andre's message
- disable Unicode entirely with a configuration switch
- documentation
- marshalling UCS2 strings containing lone surrogates

Anything else?

--Guido van Rossum (home page: http://www.python.org/~guido/)