[Numpy-discussion] Extent of unicode types in numpy
Travis Oliphant
oliphant.travis at ieee.org
Tue Feb 7 11:27:04 EST 2006
Gerard Vermeulen wrote:
>>While I agree that this solution is more consistent, I must say that
>>I'm not very confortable with having to deal with two different widths
>>for unicode characters.
>>
Python itself hands us this difference. Is it really so different then
the fact that python integers are either 32-bit or 64-bit depending on
the platform.
Perhaps what this is telling us, is that we do indeed need another
data-type for 4-byte unicode. It's how we solve the problem of 32-bit
or 64-bit integers (we have a 64-bit integer on all platforms).
Then in NumPy we can support going back and forth between UCS-2 (which
we can then say is UTF-16) and UCS-4.
The issue with saving to disk is really one of encoding anyway. So, if
PyTables want's do do this correctly, then it should be using a
particular encoding anyway.
The internal representation of Unicode should not technically matter as
it's only input and output that is important.
I won't support requiring a UCS-4 build of Python, though. That's too
stringent. Most characters are contained within the 0th plane of
UCS-2. For the additional characters (only up to 0x0010FFFF are
defined), the surrogate pairs can be used.
I think the best solution is to define separate UCS4 and UCS2 data-types
and handle conversion between them using the casting functions. This
is a bit of work to implement, but not too bad...
>Wouldn't it be possible that numpy takes care of the "surrogate pairs"
>when transferring unicode strings from UCS2-interpreters to UCS4-ndarrays
>and vice-versa?
>
>It would be nice to be able to cast explicitly between UCS2- and UCS4- arrays,
>too.
>
>Requesting users to recompile their Python is a rather brutal solution :-)
>
>
I agree. I much prefer an additional data-type since that is after-all
what UCS2 and UCS4 are... different data-types.
-Travis
More information about the NumPy-Discussion
mailing list