[Numpy-discussion] `np.array()`, array-likes, nested sequences and subclasses

Thu Jun 18 10:49:35 EDT 2020

Hi all,

tl;dr: `np.array()` is somewhat ill-defined, also creating issues for
Quantities.  In a recent PR I am cementing, and slightly broadening,
its definition.  So we have to decide how we wish to handle code such
as in the long run:

    np.array([array-like, array-like])

---

Traditionally, we have two meanings of "array-like" as understood by
`np.array()` (In the text I use array-like for the second point here):

1. Nested sequences of scalars.

2. A single array-like object, meaning a buffer-interface, an array
   subclass, a pandas dataframe (`__array__()`), etc.

However, the boundaries between these are fuzzy, and over the years
became more fuzzy.  The reason is that a NumPy array (and many array-
likes) are also nested sequences of scalars.

I defined the current behaviour slightly clearer in my PR, but by that
also subtly broadened it up [0]:

1. Any array-like embedded in the nested-sequences is converted to a
   NumPy array. [1] (Any array-like is never interpreted as a sequence)

2. Any array-like's elements will be elements of the output.
   We never enter array-likes recursively (including object arrays).

3. The `subok=True` parameter is implicitly ignored, unless the input
   is a single ndarray sublcass.

Now to the issues at hand:

* We should make sure those defintions are good, they mainly cement
  current behaviour, but if we want to roll back on features,
  we should do it now.

* There are some issues around Quantity and masked arrays,
  because their "scalars" are (sometimes) 0-D arrays.  And they
  currently rely on NumPy considering them to be scalars.
  This has its own set of long term issues [2].

For now, I can simply roll the changes to 0-D array behaviour back. 
But in the mid-to-long run, we have to make a decision, or perpetually
live with array subclasses being subtly broken:

1. Define Quantity and Masked arrays as wrong.  They must use a
   special DType, which consistently tells NumPy that the elements
   cannot simply be copied by converting the Quantity to an array.
   The up-side is, that it generalizes to N-D.

2. Independently, but partially addressing the Quantity issue, we have
   to decide what `np.array()` should actually do.  A sequence 
   containing array-likes, in most cases is better written using 
   `np.stack()`, but due to the fuzzy boundaries, code like
   `np.array([dataframe, dataframe])` is probably common.
   We could try to deprecate though.

The downsides to deprecation seem to me that I feel we have to reject
viewing array-likes as sequences.  To me doing that has its own set of
issues.  If just that `np.array([arraylike])` seems perfectly
reasonable, but may be very slow.

- Sebastian

[0] It is hard to list how exactly it is broadened up, because the
current behaviour has very subtle behaviours, such as actually
iterating a `memoryview()`, which does always the same thing, but only
works for 1-D memoryviews, and fails for both 0-D and N-D.

[1] There are some subtleties which are not important here, such that I
do anticipate the possibility of having array-likes which are
considered scalars with respect to a given dtype, such as
`np.array([poly], dtype=Polynomial)` where a poly object itself is an
array-like.

[2] Basically:

    np.array([0d_array], dtype=user_dtype)

works, by ending up calling:

    res[0] = float(0d_array)  # quantity.__float__ is used!

which works nice for the typical float/int dtype, is tricky to get
right for general dtypes (e.g. longdouble/clongdouble).  This is a
small issue now, but it could become a problem when more user-dtypes
are defined.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20200618/ae62eef5/attachment.sig>