[Numpy-discussion] Extent of unicode types in numpy
Travis Oliphant
oliphant at ee.byu.edu
Mon Feb 6 17:28:19 EST 2006
Tim Hochberg wrote:
>> Right now, the typestring value gives the number of bytes in the
>> type. Thus, "U4" gives dtype("<U8") on my system where
>> sizeof(Py_UNICODE)==2, but on another system it could give
>> dtype("<U16").
>> I know only a little-bit about unicode. The full Unicode character
>> is a 4-byte entity, but there are standard 2-byte (UTF-16) and even
>> 1-byte (UTF-8) encoders.
>>
>> I changed the source so that ("<U8") gets interpreted the same as
>> "U4" (i.e. if you specify an endianness then you are being
>> byte-conscious anyway and so the number is interpreted as a byte,
>> otherwise the number is interpreted as a length). This fixes issues
>> on the same platform, but does not fix issues where data is saved out
>> with one Python interpreter and read in by another with a different
>> value of sizeof(Py_UNICODE).
>
>
> This sounds like a mess. I'm not sure what the level of Unicode
> expertise is one this list (I certainly don't add to it), but I'd be
> tempted to raise this issue on PythonDev and see if anyone there has
> any good suggestions.
>
I'm not a unicode expert, but I have read-up on it so I think I at least
understand the issues involved.
> I'm way out of my depth here, but it really sounds like there needs to
> be one descriptor for each type. Just for example "U" could be 2-byte
> unicode and "V" (assuming it's not taken already) could be 4-byte
> unicode. Then the size for a given descriptor would be constant and
> things would be much less confusing.
>
This is what I'm currently thinking. The question is would we have to
define a new basic data-type for 4-byte unicode or would we just handle
this on the input. Would we also define a 1-byte unicode data-type or
just let the user deal with that using standard strings and encoding as
is currently done in Python.
-Travis
More information about the NumPy-Discussion
mailing list