[Numpy-discussion] Extent of unicode types in numpy

Tue Feb 7 11:27:04 EST 2006

Gerard Vermeulen wrote:

>>While I agree that this solution is more consistent, I must say that
>>I'm not very confortable with having to deal with two different widths
>>for unicode characters. 
>>
Python itself hands us this difference.  Is it really so different then 
the fact that python integers are either 32-bit or 64-bit depending on 
the platform.  

Perhaps what this is telling us, is that we do indeed need another 
data-type for 4-byte unicode.   It's how we solve the problem of 32-bit 
or 64-bit integers (we have a 64-bit integer on all platforms).

Then in NumPy we can support going back and forth between UCS-2 (which 
we can then say is UTF-16) and UCS-4.

The issue with saving to disk is really one of encoding anyway.  So, if 
PyTables want's do do this correctly, then it should be using a 
particular encoding anyway.

The internal representation of Unicode should not technically matter as 
it's only input and output that is important.

I won't support requiring a UCS-4 build of Python, though.  That's too 
stringent.  Most characters are contained within the 0th plane of 
UCS-2.  For the additional characters (only up to 0x0010FFFF are 
defined), the surrogate pairs can be used.

I think the best solution is to define separate UCS4 and UCS2 data-types 
and handle conversion between them using the casting functions.   This 
is a bit of work to implement, but not too bad...

>Wouldn't it be possible that numpy takes care of the "surrogate pairs"
>when transferring unicode strings from UCS2-interpreters to UCS4-ndarrays
>and vice-versa?
>
>It would be nice to be able to cast explicitly between UCS2- and UCS4- arrays,
>too.
>
>Requesting users to recompile their Python is a rather brutal solution :-)
>  
>
I agree.  I much prefer an additional data-type since that is after-all 
what UCS2 and UCS4 are... different data-types.

-Travis