[Numpy-discussion] `np.array()`, array-likes, nested sequences and subclasses
Sebastian Berg
sebastian at sipsolutions.net
Thu Jun 18 10:49:35 EDT 2020
Hi all,
tl;dr: `np.array()` is somewhat ill-defined, also creating issues for
Quantities. In a recent PR I am cementing, and slightly broadening,
its definition. So we have to decide how we wish to handle code such
as in the long run:
np.array([array-like, array-like])
---
Traditionally, we have two meanings of "array-like" as understood by
`np.array()` (In the text I use array-like for the second point here):
1. Nested sequences of scalars.
2. A single array-like object, meaning a buffer-interface, an array
subclass, a pandas dataframe (`__array__()`), etc.
However, the boundaries between these are fuzzy, and over the years
became more fuzzy. The reason is that a NumPy array (and many array-
likes) are also nested sequences of scalars.
I defined the current behaviour slightly clearer in my PR, but by that
also subtly broadened it up [0]:
1. Any array-like embedded in the nested-sequences is converted to a
NumPy array. [1] (Any array-like is never interpreted as a sequence)
2. Any array-like's elements will be elements of the output.
We never enter array-likes recursively (including object arrays).
3. The `subok=True` parameter is implicitly ignored, unless the input
is a single ndarray sublcass.
Now to the issues at hand:
* We should make sure those defintions are good, they mainly cement
current behaviour, but if we want to roll back on features,
we should do it now.
* There are some issues around Quantity and masked arrays,
because their "scalars" are (sometimes) 0-D arrays. And they
currently rely on NumPy considering them to be scalars.
This has its own set of long term issues [2].
For now, I can simply roll the changes to 0-D array behaviour back.
But in the mid-to-long run, we have to make a decision, or perpetually
live with array subclasses being subtly broken:
1. Define Quantity and Masked arrays as wrong. They must use a
special DType, which consistently tells NumPy that the elements
cannot simply be copied by converting the Quantity to an array.
The up-side is, that it generalizes to N-D.
2. Independently, but partially addressing the Quantity issue, we have
to decide what `np.array()` should actually do. A sequence
containing array-likes, in most cases is better written using
`np.stack()`, but due to the fuzzy boundaries, code like
`np.array([dataframe, dataframe])` is probably common.
We could try to deprecate though.
The downsides to deprecation seem to me that I feel we have to reject
viewing array-likes as sequences. To me doing that has its own set of
issues. If just that `np.array([arraylike])` seems perfectly
reasonable, but may be very slow.
- Sebastian
[0] It is hard to list how exactly it is broadened up, because the
current behaviour has very subtle behaviours, such as actually
iterating a `memoryview()`, which does always the same thing, but only
works for 1-D memoryviews, and fails for both 0-D and N-D.
[1] There are some subtleties which are not important here, such that I
do anticipate the possibility of having array-likes which are
considered scalars with respect to a given dtype, such as
`np.array([poly], dtype=Polynomial)` where a poly object itself is an
array-like.
[2] Basically:
np.array([0d_array], dtype=user_dtype)
works, by ending up calling:
res[0] = float(0d_array) # quantity.__float__ is used!
which works nice for the typical float/int dtype, is tricky to get
right for general dtypes (e.g. longdouble/clongdouble). This is a
small issue now, but it could become a problem when more user-dtypes
are defined.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20200618/ae62eef5/attachment.sig>
More information about the NumPy-Discussion
mailing list