[Numpy-discussion] Extent of unicode types in numpy

Travis Oliphant oliphant.travis at ieee.org
Mon Feb 6 23:17:04 EST 2006


> I'm way out of my depth here, but it really sounds like there needs to 
> be one descriptor for each type.  Just for example "U" could be 2-byte 
> unicode and "V" (assuming it's not taken already) could be 4-byte 
> unicode. Then the size for a given descriptor would be constant and 
> things would be much less confusing.


In current SVN, numpy assumes 'w' is 2-byte unicode and 'W' is 4-byte 
unicode in the array interface typestring.   Right now these codes 
require that the number of bytes be specified explicitly (to satisfy the 
array interface requirement).   There is still only 1 Unicode data-type 
on the platform and it has the size of Python's Py_UNICODE type.  The 
character 'U' continues to be useful on data-type construction to stand 
for a unicode string of a specific character length. It's internal dtype 
representation will use 'w' or 'W' depending on how Python was compiled.

This may not solve all issues, but at least it's a bit more consistent 
and solves the problem of

dtype(dtype('U8').str) not producing the same datatype.

It also solves the problem of unicode written out with one compilation 
of Python and attempted to be written in with another (it won't let you 
because only one of 'w#' or 'W#' is supported on a platform.

-Travis





More information about the NumPy-Discussion mailing list