[I18n-sig] Support for "wide" Unicode characters

M.-A. Lemburg mal@lemburg.com
Thu, 28 Jun 2001 15:11:04 +0200


Guido van Rossum wrote:
> 
> > >     There is a new (experimental) define:
> > >
> > >         #define PY_UNICODE_SIZE 2
> >
> > Doesn't sizeof(Py_UNICODE) do the same ?
> 
> Not on a Cray!  And not in the C standard.  Ask Tim. :-)

Ah, OK... nice sofas these Crays, BTW ;-)
 
> > This introduces an incompatibility between narrow and wide
> > builds at run-time. PYC should not be harmed by this since they
> > store Unicode strings using UTF-8.
> 
> Does UTF-8 transfer isolated surrogates correctly?  I think that's
> necessary, otherwise I can't marshal or unmarshal literals containing
> those, which means that .pyc files for .py files containing those
> can't be read (on maybe aren't portable between wide and narrow
> interpreters).

It handles surrogates correctly, but rejects isolated ones on input
(easy to fix though) and passes them through on output. As I said
before, surrogate is far from being complete.
 
> Note that I'm OK with the UTF-8 encoder recognizing hi+lo surrogate
> pairs and encoding them as one Unicode character, since the decoder
> generates surrogates for non-BMP characters on a narrow platform.

That's what it currently does.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/