[Python-bugs-list] [ python-Bugs-405227 ] sizeof(Py_UNICODE)==2 ????

noreply@sourceforge.net noreply@sourceforge.net
Sun, 17 Jun 2001 12:57:23 -0700


Bugs item #405227, was updated on 2001-03-01 11:21
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=405227&group_id=5470

Category: Unicode
Group: Platform-specific
Status: Open
Resolution: Postponed
Priority: 5
Submitted By: Jon Saenz (jsaenz)
Assigned to: M.-A. Lemburg (lemburg)
Summary: sizeof(Py_UNICODE)==2 ????

Initial Comment:
We are trying to install Python 2.0 in a Cray T3E.

After a painful process of removing several modules
which produce some errors (mmap, sha, md5), we get core
dumps when we run python because under this platform,
there does not exist a 16-bit numeric type. Unsigned
short is 4 bytes long.

We have finally defined unicode objects as unsigned
short, despite they are 4 bytes long, and we have
changed a sentence in 
Objects/unicodeobject.c
to:
if (sizeof(Py_UNICODE)!=sizeof(unsigned short){

It compiles and runs now, but the test on regular
expressions crashes and the whole regression test does,
too.

Support of Unicode for this platform is not correct in
version 2.0 of Python.

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-17 12:57

Message:
Logged In: YES 
user_id=38388

The codecs are full of things like:

            ch = ((s[0] & 0x0f) << 12) + ((s[1] & 0x3f) <<
6) + (s[2] & 0x3f);
            if (ch < 0x800 || (ch >= 0xd800 && ch < 0xe000))
{
                errmsg = "illegal encoding";
                goto utf8Error;
            }

where ch is a Py_UNICODE character.

The other "problem" is that pointer dereferencing is used a
lot in the code (using arrays of Py_UNICODE chars). We could
probably shift the calculations to Py_UCS4 integers and then
only do the data buffer access with Py_UNICODE which would
then be mapped to a a 2-char-array to get the data buffer
layout right.

Still, I think this is low priority. Patches are welcome of
course :-)

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-06-17 12:44

Message:
Logged In: YES 
user_id=31435

Point me to one of the calculations that's thought to be a 
problem, and happy to suggest something (I didn't find one 
on my own, but I'm not familiar with the details here).  
BTW, I reopened this because we got another report of T3E 
woes on c.l.py that day.

You certainly need at least 16 bits, but it's hard to see 
how having more than that could be a genuine problem -- at 
worst "this kind of thing" usually requires no more than 
masking with 0xffff at the end.  That can be hidden in a 
macro that's a nop on platforms that don't need it, if 
micro-efficiency is a concern.

Often even that isn't needed.  For example, binascii_crc32 
absolutely must compute a 32-bit checksum, but works fine 
on platforms with 8-byte longs.  The only "trick" needed to 
make that work was to compute the complement via

crc ^ 0xFFFFFFFFUL

instead of via

~crc


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-17 11:47

Message:
Logged In: YES 
user_id=38388

It may be a design error, but getting this right for all
platforms is hard and by choosing the 16-bit type we managed
to handle 95% of all platforms in a fast and reliable way.

Any idea how we could "emulate" a 16-bit integer type ? We
need the integer type because we do calculcations on the
values.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-06-13 22:28

Message:
Logged In: YES 
user_id=31435

I opened this again.  It's simply unacceptable to require 
that the platform have a 2-byte integer type.  That doesn't 
mean it's easy to fix, but it's a design error all the same.


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-03-16 11:27

Message:
Logged In: YES 
user_id=38388

The current Unicode implementation needs Py_UNICODE to
be a 16-bit entity and so does SRE.

To get this to work on the Cray, you could try to use a
2-char
struct which is then cast to a short in all those places
which
assume a 16-bit number representation.

Simply using a 4-byte entity as basis will not work, since
the fact that Py_UNICODE fits into 2 bytes is hard-coded
into the implementation in a number of places.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-03-01 15:29

Message:
Logged In: YES 
user_id=31435

Notes:

+ Python was ported to T3E last year, IIRC by Marc Poinot.  
May want to track him down.

+ Python's Unicode support doesn't rely on any platform 
Unicode support.  Whether it's "useless" depends on the 
user, not the platform.

+ Face it <wink>:  Crays are the only platforms that don't 
have a native 16-bit integer type.

+ Even so, I believe at least SRE is happy to work with 32-
bit Unicode (glibc's wchar_t is 4 bytes, IIRC), so that 
much was likely a shallow problem.


----------------------------------------------------------------------

Comment By: Jon Saenz (jsaenz)
Date: 2001-03-01 15:09

Message:
Logged In: YES 
user_id=12122

We have finally given up to install Python in the Cray T3E
due to its lack of support of shared objects. This causes
difficulties in the loading of different external libraries
(Numeric, Lapack, and so on) because of the static linking.

In any case, we still think that this "bug" should be
repaired. There may be other platforms which:
1) Do not support Unicode, so that the Unicode feature of
Python is useless in these cases.
2) The users may be interested in using Python in them (for
Numeric applications, for instance)
3) May not have a 16-bit native numerical type.

Under these circunstances, the current version of Python can
not be used.

----------------------------------------------------------------------

Comment By: Jon Saenz (jsaenz)
Date: 2001-03-01 15:08

Message:
Logged In: YES 
user_id=12122

We have finally given up to install Python in the Cray T3E
due to its lack of support of shared objects. This causes
difficulties in the loading of different external libraries
(Numeric, Lapack, and so on) because of the static linking.

In any case, we still think that this "bug" should be
repaired. There may be other platforms which:
1) Do not support Unicode, so that the Unicode feature of
Python is useless in these cases.
2) The users may be interested in using Python in them (for
Numeric applications, for instance)
3) May not have a 16-bit native numerical type.

Under these circunstances, the current version of Python can
not be used.

----------------------------------------------------------------------

Comment By: Jon Saenz (jsaenz)
Date: 2001-03-01 15:08

Message:
Logged In: YES 
user_id=12122

We have finally given up to install Python in the Cray T3E
due to its lack of support of shared objects. This causes
difficulties in the loading of different external libraries
(Numeric, Lapack, and so on) because of the static linking.

In any case, we still think that this "bug" should be
repaired. There may be other platforms which:
1) Do not support Unicode, so that the Unicode feature of
Python is useless in these cases.
2) The users may be interested in using Python in them (for
Numeric applications, for instance)
3) May not have a 16-bit native numerical type.

Under these circunstances, the current version of Python can
not be used.

----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2001-03-01 14:05

Message:
Logged In: YES 
user_id=3066

Marc-Andre, can you deal with the general Unicode issues here and then pass this along to Fredrik for SRE updates?  (Or better yet, work in parallel?)

Thanks!

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=405227&group_id=5470