[Numpy-discussion] Extent of unicode types in numpy

Mon Feb 6 17:28:19 EST 2006

Tim Hochberg wrote:

>> Right now, the typestring value gives the number of bytes in the 
>> type.  Thus, "U4" gives dtype("<U8") on my system where 
>> sizeof(Py_UNICODE)==2, but on another system it could give 
>> dtype("<U16").
>> I know only a little-bit about unicode.  The full Unicode character 
>> is a 4-byte entity, but there are standard 2-byte  (UTF-16) and even 
>> 1-byte (UTF-8) encoders.
>>
>> I changed the source so that ("<U8") gets interpreted the same as 
>> "U4" (i.e. if you specify an endianness then you are being 
>> byte-conscious anyway and so the number is interpreted as a byte, 
>> otherwise the number is interpreted as a length).  This fixes issues 
>> on the same platform, but does not fix issues where data is saved out 
>> with one Python interpreter and read in by another with a different 
>> value of sizeof(Py_UNICODE). 
>
>
> This sounds like a mess. I'm not sure what the level of Unicode 
> expertise is one this list (I certainly don't add to it), but I'd be 
> tempted to raise this issue on PythonDev and see if anyone there has 
> any good suggestions.
>
I'm not a unicode expert, but I have read-up on it so I think I at least 
understand the issues involved. 

> I'm way out of my depth here, but it really sounds like there needs to 
> be one descriptor for each type.  Just for example "U" could be 2-byte 
> unicode and "V" (assuming it's not taken already) could be 4-byte 
> unicode. Then the size for a given descriptor would be constant and 
> things would be much less confusing.
>
This is what I'm currently thinking.  The question is would we have to 
define a new basic data-type for 4-byte unicode or would we just handle 
this on the input.  Would we also define a 1-byte unicode data-type or 
just let the user deal with that using standard strings and encoding as 
is currently done in Python.

-Travis