[Numpy-discussion] Extent of unicode types in numpy

Wed Feb 8 00:09:10 EST 2006

El dt 07 de 02 del 2006 a les 13:35 -0700, en/na Travis Oliphant va
escriure:
> Sure it could be implemented.  It's just a matter of effort.  Python 
> itself always defines a Py_UCS4 type even on UCS2 builds.  We would just 
> have to make sure Py_UCS2 is always defined as well. 

Be careful with this because you can run into problems. For example,
trying to import numpy compiled with a UCS4 python from a UCS2 one,
gives me the following:

$ python
Python 2.4.2 (#1, Feb  8 2006, 08:16:44)
[GCC 4.0.3 20060115 (prerelease) (Debian 4.0.2-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
import core ->
failed: /usr/lib/python2.4/site-packages/numpy/core/multiarray.so:
undefined symbol: _PyUnicodeUCS4_IsWhitespace
import random -> failed: 'module' object has no attribute 'dtype'
import lib ->
failed: /usr/lib/python2.4/site-packages/numpy/core/multiarray.so:
undefined symbol: _PyUnicodeUCS4_IsWhitespace

Although I guess that this would be not a problem when using a numpy
compiled with a proper interpreter. Just wanted to point out this.

> The biggest hassle is implementing the corresponding scalar type.  The 
> one corresponding to the build for Python comes free.  The other would 
> have to be implemented directly.

Yeah, it seems like we should end implementing a new Unicode type
entirely in NumPy in a way or other.

> I've seen data-bases handle this by warning the user to make sure the 
> size of their data area is large enough to handle their longest use 
> case.  You can still used fixed-sizes you just have to make sure they 
> are large enough (or risk truncation).

Ok. I can admit that data can be truncated (you may end with a corrupted
Unicode string, but this is the responsability of the user :-().
However, another thing that I feel unconfortable with is the additional
encoding/decoding steps that potentially introduces UCS2 for doing I/O.
Well, perhaps this is faster than I suppose and that I/O speed will not
be too affected, but still...

> >Well, I don't understand well here. I thought that you were proposing a
> >32-bit unicode type for NumPy and then converting it appropriately to
> >UCS2 (conversion to UCS4 wouldn't be necessary as it would be the same
> >as the native NumPy unicode type) just in case that the user requires an
> >scalar out of the NumPy object. But you are talking here about defining
> >separate UCS4 and UCS2 data-types. I admit that I'm loosed here...
> >
> >  
> >
> I suppose that is another approach:  we could internally have all 
> UNICODE data-types use 4-bytes and do the conversions necessary.  But, 
> it would still require us to do most of work of supporting two 
> data-types.  Currently, the unicode scalar object is a simple 
> inheritance from Python's UNICODE data-type.  That would have to change 
> and the work to do that is most of the work to support two different 
> data-types.   So, if we are going to go through that effort.  I would 
> rather see the result be two different Unicode data-types supported. 

Ok. I see that you got my point. Well, maybe I'm wrong here, but my
proposal would result in implementing just one new data-type for 32-bit
unicode when the python platform is UCS2 aware. If, as you said above,
Py_UCS4 type is always defined, even on UCS2 interpreters, that should
be relatively easy to do. So, you we can make all the NumPy unicode
*arrays* based on this new type. The NumPy unicode *scalars* will
inherit directly from the native Py_UCS2 type for this interpreter.
Then, we just have to implement the necessary conversions between
UCS4<-->UCS2 to comunicate data from NumPy array into/from scalar type.
The only drawback that I see in this approach is that you will end
having UCS4 types in numpy ndarrays and UCS2 types when getting scalars
from them (however, the user will hardly notice this, IMO). The
advantage would be that NumPy arrays will always be UCS4 irregardingly
of the platform they are, making the access to their data from C much
easier and portable (and yes, efficient!).

Of course, if you are using a UCS4 platform, then you can choose the
same native Py_UCS4 type for NumPy arrays and scalars and you are done.

Well, probably I've overlooked something, but I really think that this
would be a nice thing to do.

Regards,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"