[Numpy-discussion] What to do about structured string dtype and string regression?

Ralf Gommers ralf.gommers at gmail.com
Wed Feb 17 05:15:50 EST 2021


On Wed, Feb 17, 2021 at 2:14 AM Stephan Hoyer <shoyer at gmail.com> wrote:

> On Tue, Feb 16, 2021 at 3:13 PM Sebastian Berg <sebastian at sipsolutions.net>
> wrote:
>
>> Hi all,
>>
>> In https://github.com/numpy/numpy/issues/18407 it was reported that
>> there is a regression for `np.array()` and friends in NumPy 1.20 for
>> code such as:
>>
>>     np.array(["1234"], dtype=("U1", 4))
>>     # NumPy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
>>     # NumPy 1.19: array(['1', '2', '3', '4'], dtype='<U1')
>>
>>
>> The Basics
>> ----------
>>
>> This happens when you ask for a rare "subarray" dtype, ways to create
>> it are:
>>
>>     np.dtype(("U1", 4))
>>     np.dtype("(4)U1,")  # (does not have a field, only a subarray)
>>
>> Both of which give the same subarray dtype a "U1" dtype with shape 4.
>> One thing to know about these dtypes is that they cannot be attached to
>> an array:
>>
>>     np.zeros(3, dtype="(4)U1,").dtype == "U1"
>>     np.zeros(3, dtype="(4)U1,").shape == (3, 4)
>>
>> I.e. the shape is moved/added into the array itself (instead of
>> remaining part of the dtype).
>>
>> The Change
>> ----------
>>
>> Now what/why did something change?  When filling subarray dtypes, NumPy
>> normally fills every element with the same input. In the above case in
>> most cases NumPy will give the 1.20 result because it assigns "1234" to
>> every subarray element individually; maybe confusingly, this truncates
>> so that only the "1" is actually assigned, we can proof it with a
>> structured dtype (same result in 1.19 and 1.20):
>>
>>     >>> np.array(["1234"], dtype="(4)U1,i")
>>     array([(['1', '1', '1', '1'], 1234)],
>>           dtype=[('f0', '<U1', (4,)), ('f1', '<i4')])
>>
>> Another, weirder case which changed (more obviously for the better is:
>>
>>     >>> np.array("1234", dtype="(4)U1,")
>>     # Numpy 1.20: array(['1', '1', '1', '1'], dtype='<U1')
>>     # NumPy 1.19: array(['1', '', '', ''], dtype='<U1')
>>
>> And, to point it out, we can have subarrays that are not 1-D:
>>
>>     >>> np.array(["12"],dtype=("(2,2)U1,"))
>>     array([[['1', '1'],
>>         ['2', '2']]], dtype='<U1')  # NumPy 1.19, 1.20 all is '1'
>>
>>
>> The Cause
>> ---------
>>
>> The cause of the 1.19 behaviour is two-fold:
>>
>> 1. The "subarray" part of the dtype is moved into the array after the
>> dimension is found. At this point strings are always considered
>> "scalars".  In most above examples, the new array shape is (1,)+(4,).
>>
>> 2. When filling the new array with values, it now has an _additional_
>> dimension!  Because of this, the string is now suddenly considered a
>> sequence, so it behaves the same as if `list("1234")`.  Although,
>> normally, NumPy would never consider a string a sequence.
>>
>>
>> The Solution?
>> -------------
>>
>> I honestly don't have one.  We can consider strings as sequences in
>> this weird special case.  That will probably create other weird special
>> cases, but they would be even more hidden (I expect mainly odder things
>> throwing an error).
>>
>> Should we try to document this better in the release notes or can we
>> think of some better (or at least louder) solution?
>>
>
I was honestly surprised there's even such a thing as a "subarray data
type", I've never seen it used in the wild. Looking at the release notes
you already have,
https://numpy.org/devdocs/release/1.20.0-notes.html#arrays-cannot-be-using-subarray-dtypes,
all I'm thinking is that no one should ever be writing code like that.


> There are way too many unsafe assumptions in this example. It's an edge
> case of an edge case.
>
> I don't think we should be beholden to continuing to support this
> behavior, which was obviously never anticipated. If there was a way to
> raise a warning or error in potentially ambiguous situations like this, I
> would support it.
>

+1

Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/numpy-discussion/attachments/20210217/bc43ce54/attachment.html>


More information about the NumPy-Discussion mailing list