[Python-Dev] Hindsight on Py_UNICODE_WIDE?

Fri Mar 23 20:04:41 CET 2007

On 2007-03-23 19:18, Jason Orendorff wrote:
> Scheme is adding Unicode support in an upcoming standard:
> (DRAFT) http://www.r6rs.org/document/lib-html/r6rs-lib-Z-H-3.html
> 
> I have two questions for the python-dev team about Python's Unicode
> experiences.  If it's convenient, please take a moment to reply.
> Thanks in advance.
> 
> 1.  In hindsight, what do you think about PEP 261, the Py_UNICODE_WIDE
> build option?  On balance, has this been good, bad, or indifferent?
> What's good/bad about it?

Having narrow and wide builds introduces a level of complexity
that seems unnecessary. Few people ever use non-BMP code points
and the ones who do can easily get away with UTF-16 surrogates.

Most Unixes have chosen to go with UCS4 as storage format, so
you have little choice if you want to take advantage of mapping
directly to wchar on Unix.

Windows has chosen UTF-16 as internal storage format and wchar
is 16-bit on that platform.

You may also want to consider looking at PEP 263:

   http://www.python.org/dev/peps/pep-0263

Source code encoding is a great thing ! You can now write native
Unicode in Python source code.

The only downside is the extra complexity added by the fact
that the tokenizer in Py2 works on 8-bit characters. For this reason
we had to decode the source code to Unicode, then encode it to UTF-8,
pass it to the tokenizer and then decode the UTF-8 literal strings
for Unicode back into Unicode again.

Ideally, the tokenizer in Py3k should be rewritten to work directly on
Unicode.

> 2.  The idea of multiple string representations has come up (that is,
> where all strings are Unicode, but in memory some are 8-bit, some
> 16-bit, and some 32-bit--each string uses the narrowest possible
> representation).  This has been discussed here for Python 3000.  My
> question is:  Is this for real?  How far along is it?  How likely is
> it?

My suggestion for Scheme is not to go down that route. It adds
complexity for little added value and also makes the implementation
slower (due to the frequent conversion from one internal format
to another).

Can't comment on Py3k - I'm out of that loop.

If you want to know more about how Unicode was added to Python 2.x
and how it can be used, I suggest you read the following:

Unicode integration (one of the first PEPs ever written :-):

   http://www.python.org/dev/peps/pep-0100

Unicode in Python:

   http://www.egenix.com/files/python/EuroPython2002-Python-and-Unicode.pdf

Designing Unicode-aware Applications in Python:

http://www.egenix.com/files/python/EPC2006-Developing-Unicode-aware-applications-in-Python.pdf

Hope that helps,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 23 2007)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::