[Python-Dev] 2.2 Unicode questions

Andrew Kuchling akuchlin@mems-exchange.org
Wed, 18 Jul 2001 21:55:46 -0400


I've written some text on Unicode for the 2.2 article, but it's
doubtful I actually understand what's going on.  Can people who
actually understand where Unicode has been please take a look at the
following?  

First, a short one, Mark Hammond's patch for supporting MBCS on
Windows.  I trust everyone can handle a little bit of TeX markup?

  % XXX is this explanation correct?  
  \item When presented with a Unicode filename on Windows, Python will
  now correctly convert it to a string using the MBCS encoding.
  Filenames on Windows are a case where Python's choice of ASCII as
  the default encoding turns out to be an annoyance.  

  This patch also adds \samp{et} as a format sequence to
  \cfunction{PyArg_ParseTuple}; \samp{et} takes both a parameter and
  an encoding name, and converts it to the given encoding if the
  parameter turns out to be a Unicode string, or leaves it alone if
  it's an 8-bit string, assuming it to already be in the desired
  encoding.  (This differs from the \samp{es} format character, which
  assumes that 8-bit strings are in Python's default ASCII encoding
  and converts them to the specified new encoding.)
   
  (Contributed by Mark Hammond with assistance from Marc-Andr\'e
  Lemburg.)

Second, the --enable-unicode changes:

%======================================================================
\section{Unicode Changes}

Python's Unicode support has been enhanced a bit in 2.2.  Unicode
strings are usually stored as UCS-2, as 16-bit unsigned integers.
Python 2.2 can also be compiled to use UCS-4, 32-bit unsigned
integers, as its internal encoding by supplying
\longprogramopt{enable-unicode=ucs4} to the configure script.  When
built to use UCS-4, in theory Python could handle Unicode characters
from U-00000000 to U-7FFFFFFF.  Being able to use UCS-4 internally is
a necessary step to do that, but it's not the only step, and in Python
2.2alpha1 the work isn't complete yet.  For example, the
\function{unichr()} function still only accepts values from 0 to
65535, and there's no \code{\e U} notation for embedding characters
greater than 65535 in a Unicode string literal.  All this is the
province of the still-unimplemented PEP 261, ``Support for `wide'
Unicode characters''; consult it for further details, and please offer
comments and suggestions on the proposal it describes.

% ... section on decode() deleted; on firmer ground there...

\method{encode()} and \method{decode()} were implemented by
Marc-Andr\'e Lemburg.  The changes to support using UCS-4 internally
were implemented by Fredrik Lundh and Martin von L\"owis.

\begin{seealso}

\seepep{261}{Support for `wide' Unicode characters}{PEP written by
Paul Prescod.  Not yet accepted or fully implemented.}

\end{seealso}

Corrections?  Thanks in advance...

--amk