[issue15625] Support u and w codes in memoryview

Stefan Krah report at bugs.python.org
Wed Aug 15 19:25:04 CEST 2012


Stefan Krah added the comment:

Nick's comment in msg167963 got me thinking. Indeed, in Numpy the 'U'
specifier is similar to the struct module's 's' format code, only for
UCS4. So I'm questioning whether the current semantics of 'u' and 'w'
used by array.array were ever intended by the PEP authors:


import numpy

>>> nd = numpy.array(["A", "B"], dtype='U')
>>> nd
array(['A', 'B'],
      dtype='<U1')
>>> nd.tostring()
b'A\x00\x00\x00B\x00\x00\x00'
>>>
>>> nd = numpy.array(["ABC", "D"], dtype='U')
>>> nd
array(['ABC', 'D'],
      dtype='<U3')
>>> nd.tostring()
b'A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>>


Internally, in NumPy 'U' is always UCS4, and the data type is a fixed
length string that has the length of the longest initializer element.


NumPy's use of 'U' seems vastly more useful for arrays than the behavior
of array.array:

>>> array.array('u', ['A', 'B'])
array('u', 'AB')
>>> array.array('u', ['ABC', 'D'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: array item must be unicode character


In Numpy, arrays of words are possible, with array.array they are not.

An additional thought: The convention in the struct module is to use
uppercase for unsigned types. So it would be a possibility to use
'C', 'U' and 'W', where '3C' would denote the same as '3s', except
for UCS1 instead of bytes.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue15625>
_______________________________________


More information about the Python-bugs-list mailing list