[I18n-sig] Python Support for "Wide" Unicode characters

Guido van Rossum guido@digicool.com
Wed, 27 Jun 2001 20:43:44 -0400


> I don't suppose that anyone has actually considered just using a 24-bit  
> scalar type?  What would be the downside to doing so?
> 
> 	Rick

Because of alignment requirements and the absence in general of a
3-byte integral type in C, you can't extract a 24-bit integer given
its address without doing something like two shifts and two or
operations.  For mostly the same reasons you also can't declare arrays
of 3-byte integers, so you'd have to do all your address arithmetic
yourself.

While none of this makes it impossible, it makes it impractical,
because ever place in the code that indexes or declares a Py_UNICODE
array would have to be patched.  The elegance of the 4-byte approach
is that almost all code continues to work without changes.

(Technically, it's the "smallest integral type containing at least 32
bits" approach.  C guarantees there always is such a type, since long
is guaranteed to be at least 32 bits.  I suppose we could try to be
exact and use the "smallest integral type containing at least 21 bits"
approach, but it wouldn't make a difference on current practical
hardware.  It would have 20 years ago, when machines with 24 or 28
bits per word were common. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)