[Numpy-discussion] Extent of unicode types in numpy
Travis Oliphant
oliphant.travis at ieee.org
Mon Feb 6 23:17:04 EST 2006
> I'm way out of my depth here, but it really sounds like there needs to
> be one descriptor for each type. Just for example "U" could be 2-byte
> unicode and "V" (assuming it's not taken already) could be 4-byte
> unicode. Then the size for a given descriptor would be constant and
> things would be much less confusing.
In current SVN, numpy assumes 'w' is 2-byte unicode and 'W' is 4-byte
unicode in the array interface typestring. Right now these codes
require that the number of bytes be specified explicitly (to satisfy the
array interface requirement). There is still only 1 Unicode data-type
on the platform and it has the size of Python's Py_UNICODE type. The
character 'U' continues to be useful on data-type construction to stand
for a unicode string of a specific character length. It's internal dtype
representation will use 'w' or 'W' depending on how Python was compiled.
This may not solve all issues, but at least it's a bit more consistent
and solves the problem of
dtype(dtype('U8').str) not producing the same datatype.
It also solves the problem of unicode written out with one compilation
of Python and attempted to be written in with another (it won't let you
because only one of 'w#' or 'W#' is supported on a platform.
-Travis
More information about the NumPy-Discussion
mailing list