[Numpy-discussion] Separating out the maskna code

Sun May 20 01:08:27 EDT 2012

Wow, Nathaniel.   This looks like a nice piece of tedious work.     I have not reviewed it in detail, but in general I would be very supportive of your plan to commit this to master, make a 1.7 release (without the ReduceWrapper) function and then work on the masked array / ndarray separation plan for 1.8

Of course, first I would want to hear from Mark, to hear his comments about what was removed. 

-Travis

On May 19, 2012, at 3:54 PM, Nathaniel Smith wrote:

> Hi all,
> 
> Since Mark's original missingdata branch made so many changes, I
> figured it would be a useful exercise to figure out what code in
> master is actually related to masked arrays, and which isn't. The
> easiest way seemed to be to delete the new fields, then keep removing
> any code that depended on them until everything built again, which
> gives this:
>  https://github.com/njsmith/numpy/commits/separate-maskna
> 
> Possible uses:
>  - Use the diff as a checklist for going through to change the API
>  - Use the diff as a checklist for adding an experimental-API-only flag
>  - Merge into master, then use as a reference to cherrypick the
> pieces that we want to save (there is some questionable stuff in here
> -- e.g. copy/paste code, hand-wavy half-implementation of "multi-NA"
> support, and PyArray_ReduceWrapper, see below)
>  - Merge into master and then immediately 'git revert' the changes on
> a branch, which would effectively 'move it aside' so we can release
> 1.7 while Mark and Travis to continue hacking on it at their own pace
> 
> This is a huge patch, but I was pretty careful not to cause any
> accidental non-maskna-related regressions. The whole PyArray_Diagonal
> thread actually happened because I noticed that test_maskna had the
> only tests for np.diagonal, so I wanted to write proper ones that
> would be independent of the maskna code. Also I ran the following
> tests:
>  - numpy.test("full")
>  - scipy v0.10.1, test("full")
>  - matplotlib current master
>  - pandas v0.7.3
> and everything looks good.
> 
> The other complicated thing to handle was the new
> PyArray_ReduceWrapper function that was added to the public multiarray
> API. Conceptually, this function has only a glancing relationship to
> masked arrays per se. But, it has its own problems, and I don't think
> it should be exposed in 1.7. Partly because its signature necessarily
> changes depending on whether maskna support exists. Partly because
> it's just kind of ugly (it has 15 arguments, after I removed some[1]).
> But mostly because it gives us two independent generic implementations
> of functions that do array->scalar operations, which seems like
> something we absolutely don't want to commit to supporting
> indefinitely. And the "generalized ufunc" alternative seems a lot more
> promising. So, that branch also has a followup patch that does the
> necessary hacking to get it out of the public API.
> 
> Unfortunately, this patch is dependent on the previous one -- I'm not
> sure how to untangle PyArray_ReduceWrapper while keeping the maskna
> support in, which makes the "global experimental flag" idea for 1.7
> hard to implement (assuming that others agree about
> PyArray_ReduceWrapper being unready for public exposure).
> 
> At this point, it might easiest to just merge this branch to master,
> immediately revert it on a branch for Mark and Travis to work on, and
> then release 1.7.
> 
> Ralf, IIUC merging this and my other outstanding PRs would leave the
> datetime issues on python3/win32 as the only outstanding blocker?
> 
> - N
> 
> [1] http://www-pu.informatik.uni-tuebingen.de/users/klaeren/epigrams.html
> , number 11
> 
> ------------ Commit messages follow for reference/discussion
> 
> The main change is commit 4c16813c23b20:
>    Remove maskna API from ndarray, and all (and only) the code supporting it
> 
>    The original masked-NA-NEP branch contained a large number of changes
>    in addition to the core NA support. For example:
>     - ufunc.__call__ support for where= argument
>     - nditer support for arbitrary masks (in support of where=)
>     - ufunc.reduce support for simultaneous reduction over multiple axes
>     - a new "array assignment API"
>     - ndarray.diagonal() returning a view in all cases
>     - bug-fixes in __array_priority__ handling
>     - datetime test changes
>    etc. There's no consensus yet on what should be done with the
>    maskna-related part of this branch, but the rest is generally useful
>    and uncontroversial, so the goal of this branch is to identify exactly
>    which code changes are involved in maskna support.
> 
>    The basic strategy used to create this patch was:
>     - Remove the new masking-related fields from ndarray, so no arrays
>       are masked
>     - Go through and remove all the code that this makes
>       dead/inaccessible/irrelevant, in a largely mechanical fashion. So
>       for example, if I saw 'if (PyArray_HASMASK(a)) { ... }' then that
>       whole block was obviously just dead code if no arrays have masks,
>       and I removed it. Likewise for function arguments like skipna that
>       are useless if there aren't any NAs to skip.
> 
>    This changed the signature of a number of functions that were newly
>    exposed in the numpy public API. I've removed all such functions from
>    the public API, since releasing them with the NA-less signature in 1.7
>    would create pointless compatibility hassles later if and when we add
>    back the NA-related functionality. Most such functions are removed by
>    this commit; the exception is PyArray_ReduceWrapper, which requires
>    more extensive surgery, and will be handled in followup commits.
> 
>    I also removed the new ndarray.setasflat method. Reason: a comment
>    noted that the only reason this was added was to allow easier testing
>    of one branch of PyArray_CopyAsFlat. That branch is now the main
>    branch, so that isn't an issue. Nonetheless this function is arguably
>    useful, so perhaps it should have remained, but I judged that since
>    numpy's API is already hairier than we would like, it's not a good
>    idea to add extra hair "just in case". (Also AFAICT the test for this
>    method in test_maskna was actually incorrect, as noted here:
>       https://github.com/njsmith/numpyNEP/blob/master/numpyNEP.py
>    so I'm not confident that it ever worked in master, though I haven't
>    had a chance to follow-up on this.)
> 
>    I also removed numpy.count_reduce_items, since without skipna it
>    became trivial.
> 
>    I believe that these are the only exceptions to the "remove dead code"
>    strategy.
> 
> The ReduceWrapper untangling is a001fb29c9:
>    Remove PyArray_ReduceWrapper from public API
> 
>    There are two reasons to want to keep PyArray_ReduceWrapper out of the
>    public multiarray API:
>     - Its signature is likely to change if/when masked arrays are added
>     - It is essentially a wrapper for array->scalar transformations
>       (*not* just reductions as its name implies -- the whole reason it
>       is in multiarray.so in the first place is to support count_nonzero,
>       which is not actually a reduction!). It provides some nice
>       conveniences (like making it easy to apply such functions to
>       multiple axes simultaneously), but, we already have a general
>       mechanism for writing array->scalar transformations -- generalized
>       ufuncs. We do not want to have two independent, redundant
>       implementations of this functionality, one in multiarray and one in
>       umath! So in the long run we should add these nice features to the
>       generalized ufunc machinery. And in the short run, we shouldn't add
>       it to the public API and commit ourselves to supporting it.
> 
>    However, simply removing it from numpy_api.py is not easy, because
>    this code was used in both multiarray and umath. This commit:
>     - Moves ReduceWrapper and supporting code to umath/, and makes
>       appropriate changes (e.g. renaming it to PyUFunc_ReduceWrapper and
>       cleaning up the header files).
>     - Reverts numpy.count_nonzero to its previous implementation, so that
>       it loses the new axis= and keepdims= arguments. This is
>       unfortunate, but this change isn't so urgent that it's worth tying
>       our APIs in knots forever. (Perhaps in the future it can become a
>       generalized ufunc.)
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion