[Numpy-discussion] Notes from the numpy dev meeting at scipy 2015

Tue Aug 25 12:52:42 EDT 2015

On Tue, Aug 25, 2015 at 5:03 AM, Nathaniel Smith <njs at pobox.com> wrote:

> Hi all,
>
> These are the notes from the NumPy dev meeting held July 7, 2015, at
> the SciPy conference in Austin, presented here so the list can keep up
> with what happens, and so you can give feedback. Please do give
> feedback, none of this is final!
>
> (Also, if anyone who was there notices anything I left out or
> mischaracterized, please speak up -- these are a lot of notes I'm
> trying to gather together, so I could easily have missed something!)
>
> Thanks to Jill Cowan and the rest of the SciPy organizers for donating
> space and organizing logistics for us, and to the Berkeley Institute
> for Data Science for funding travel for Jaime, Nathaniel, and
> Sebastian.
>
>
> Attendees
> =========
>
>   Present in the room for all or part: Daniel Allan, Chris Barker,
>   Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
>   Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
>   pretty sure this list is incomplete)
>
>   Joining remotely for all or part: Stephan Hoyer, Julian Taylor.
>
>
> Formalizing our governance/decision making
> ==========================================
>
>   This was a major focus of discussion. At a high level, the consensus
>   was to steal IPython's governance document ("IPEP 29") and modify it
>   to remove its use of a BDFL as a "backstop" to normal community
>   consensus-based decision, and replace it with a new "backstop" based
>   on Apache-project-style consensus voting amongst the core team.
>
>   I'll send out a proper draft of this shortly for further discussion.
>
>
> Development roadmap
> ===================
>
>   General consensus:
>
>   Let's assume NumPy is going to remain important indefinitely, and
>   try to make it better, instead of waiting for something better to
>   come along. (This is unlikely to be wasted effort even if something
>   better does come along, and it's hardly a sure thing that that will
>   happen anyway.)
>
>   Let's focus on evolving numpy as far as we can without major
>   break-the-world changes (no "numpy 2.0", at least in the foreseeable
>   future).
>
>   And, as a target for that evolution, let's change our focus from
>   numpy as "NumPy is the library that gives you the np.ndarray object
>   (plus some attached infrastructure)", to "NumPy provides the
>   standard framework for working with arrays and array-like objects in
>   Python"
>
>   This means, creating defined interfaces between array-like objects /
>   ufunc objects / dtype objects, so that it becomes possible for third
>   parties to add their own and mix-and-match. Right now ufuncs are
>   pretty good at this, but if you want a new array class or dtype then
>   in most cases you pretty much have to modify numpy itself.
>
>   Vision: instead of everyone who wants a new container type having to
>   reimplement all of numpy, Alice can implement an array class using
>   (sparse / distributed / compressed / tiled / gpu / out-of-core /
>   delayed / ...) storage, pass it to code that was written using
>   direct calls to np.* functions, and it just works. (Instead of
>   np.sin being "the way you calculate the sine of an ndarray", it's
>   "the way you calculate the sine of any array-like container
>   object".)
>
>   Vision: Darryl can implement a new dtype for (categorical data /
>   astronomical dates / integers-with-missing-values / ...) without
>   having to touch the numpy core.
>
>   Vision: Chandni can then come along and combine them by doing
>
>   a = alice_array([...], dtype=darryl_dtype)
>
>   and it just works.
>
>   Vision: no-one is tempted to subclass ndarray, because anything you
>   can do with an ndarray subclass you can also easily do by defining
>   your own new class that implements the "array protocol".
>
>
> Supporting third-party array types
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>   Sub-goals:
>   - Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
>     API right there.
>   - Go through the rest of the stuff in numpy, and figure out some
>     story for how to let it handle third-party array classes:
>     - ufunc ALL the things: Some things can be converted directly into
>       (g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
>       things could be converted into (g)ufuncs if we extended the
>       (g)ufunc interface a bit (e.g. np.sort, np.matmul).
>     - Some things probably need their own __numpy_ufunc__-like
>       extensions (__numpy_concatenate__?)
>   - Provide tools to make it easier to implement the more complicated
>     parts of an array object (e.g. the bazillion different methods,
>     many of which are ufuncs in disguise, or indexing)
>   - Longer-run interesting research project: __numpy_ufunc__ requires
>     that one or the other object have explicit knowledge of how to
>     handle the other, so to handle binary ufuncs with N array types
>     you need something like N**2 __numpy_ufunc__ code paths. As an
>     alternative, if there were some interface that an object could
>     export that provided the operations nditer needs to efficiently
>     iterate over (chunks of) it, then you would only need N
>     implementations of this interface to handle all N**2 operations.
>
>   This would solve a lot of problems for projects like:
>   - blosc
>   - dask
>   - distarray
>   - numpy.ma
>   - pandas
>   - scipy.sparse
>   - xray
>
>
> Supporting third-party dtypes
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>   We already have something like a C level "dtype
>   protocol". Conceptually, the way you define a new dtype is by
>   defining a new class whose instances have data attributes defining
>   the parameters of the dtype (what fields are in *this* record dtype,
>   how many characters are in *this* string dtype, what units are used
>   for *this* datetime64, etc.), and you define a bunch of methods to
>   do things like convert an object from a Python object to your dtype
>   or vice-versa, to copy an array of your dtype from one place to
>   another, to cast to and from your new dtype, etc. This part is
>   great.
>
>   The problem is, in the current implementation, we don't actually use
>   the Python object system to define these classes / attributes /
>   methods. Instead, all possible dtypes are jammed into a single
>   Python-level class, whose struct has fields for the union of all
>   possible dtype's attributes, and instead of Python-style method
>   slots there's just a big table of function pointers attached to each
>   object.
>
>   So the main proposal is that we keep the basic design, but switch it
>   so that the float64 dtype, the int64 dtype, etc. actually literally
>   are subclasses of np.dtype, each implementing their own fields and
>   Python-style methods.
>
>   Some of the pieces involved in doing this:
>
>   - The current dtype methods should be cleaned up -- e.g. 'dot' and
>     'less_than' are both dtype methods, when conceptually they're much
>     more like ufuncs.
>
>   - The ufunc inner-loop interface currently does not get a reference
>     to the dtype object, so they can't see its attributes and this is
>     a big obstacle to many interesting dtypes (e.g., it's hard to
>     implement np.equal for categoricals if you don't know what
>     categories each has). So we need to add new arguments to the core
>     ufunc loop signature. (Fortunately this can be done in a
>     backwards-compatible way.)
>
>   - We need to figure out what exactly the dtype methods should be,
>     and add them to the dtype class (possibly with backwards
>     compatibility shims for anyone who is accessing PyArray_ArrFuncs
>     directly).
>
>   - Casting will be possibly the trickiest thing to work out, though
>     the basic idea of using dunder-dispatch-like __cast__ and
>     __rcast__ methods seems workable. (Encouragingly, this is also
>     exactly what dynd also does, though unfortunately dynd does not
>     yet support user-defined dtypes even to the extent that numpy
>     does, so there isn't much else we can steal from them.)
>     - We may also want to rethink the casting rules while we're at it,
>       since they have some very weird corners right now (e.g. see
>       [https://github.com/numpy/numpy/issues/6240])
>
>   - We need to migrate the current dtypes over to the new system,
>     which can be done in stages:
>
>     - First stick them all in a single "legacy dtype" class whose
>       methods just dispatch to the PyArray_ArrFuncs per-object "method
>       table"
>
>     - Then move each of them into their own classes
>
>   - We should provide a Python-level wrapper for the protocol, so that
>     you can call dtype methods from Python
>
>   - And vice-versa, it should be possible to subclass dtype at the
>     Python level
>
>   - etc.
>
>   Fortunately, AFAICT pretty much all of this can be done while
>   maintaining backwards compatibility (though we may want to break
>   some obscure cases to avoid expending *too* much effort with weird
>   backcompat contortions that will only help a vanishingly small
>   proportion of the userbase), and a lot of the above changes can be
>   done as semi-independent mini-projects, so there's no need for some
>   branch to go off and spend a year rewriting the world.
>
>   Obviously there are still a lot of details to work out, though. But
>   overall, there was widespread agreement that this is one of the #1
>   pain points for our users (e.g. it's the single main request from
>   pandas), and fixing it is very high priority.
>
>   Some features that would become straightforward to implement
>   (e.g. even in third-party libraries) if this were fixed:
>   - missing value support
>   - physical unit tracking (meters / seconds -> array of velocity;
>     meters + seconds -> error)
>   - better and more diverse datetime representations (e.g. datetimes
>     with attached timezones, or using funky geophysical or
>     astronomical calendars)
>   - categorical data
>   - variable length strings
>   - strings-with-encodings (e.g. latin1)
>   - forward mode automatic differentiation (write a function that
>     computes f(x) where x is an array of float64; pass that function
>     an array with a special dtype and get out both f(x) and f'(x))
>   - probably others I'm forgetting right now
>
>   I should also note that there was one substantial objection to this
>   plan, from Travis Oliphant (in discussions later in the
>   conference). I'm not confident I understand his objections well
>   enough to reproduce them here, though -- perhaps he'll elaborate.
>
>
> Money
> =====
>
>   There was an extensive discussion on the topic of: "if we had money,
>   what would we do with it?"
>
>   This is partially motivated by the realization that there are a
>   number of sources that we could probably get money from, if we had a
>   good story for what we wanted to do, so it's not just an idle
>   question.
>
>   Points of general agreement:
>
>   - Doing the in-person meeting was a good thing. We should plan do
>     that again, at least once a year. So one thing to spend money on
>     is travel subsidies to make sure that happens and is productive.
>
>   - While it's tempting to imagine hiring junior people for the more
>     frustrating/boring work like maintaining buildbots, release
>     infrastructure, updating docs, etc., this seems difficult to do
>     realistically with our current resources -- how do we hire for
>     this, who would manage them, etc.?
>
>   - On the other hand, the general feeling was that if we found the
>     money to hire a few more senior people who could take care of
>     themselves more, then that would be good and we could
>     realistically absorb that extra work without totally unbalancing
>     the project.
>
>     - A major open question is how we would recruit someone for a
>       position like this, since apparently all the obvious candidates
>       who are already active on the NumPy team already have other
>       things going on. [For calibration on how hard this can be: NYU
>       has apparently had an open position for a year with the job
>       description of "come work at NYU full-time with a
>       private-industry-competitive-salary on whatever your personal
>       open-source scientific project is" (!) and still is having an
>       extremely difficult time filling it:
>       [http://cds.nyu.edu/research-engineer/]]
>
>     - General consensus though was that there isn't much to be done
>       about this though, except try it and see.
>
>     - (By the way, if you're someone who's reading this and
>       potentially interested in like a postdoc or better working on
>       numpy, then let's talk...)
>
>
> More specific changes to numpy that had general consensus, but don't
> really fit into a high-level roadmap
>
> =========================================================================================================
>
>   - Resolved: we should merge multiarray.so and umath.so into a single
>     extension module, so that they can share utility code without the
>     current awkward contortions.
>
>   - Resolved: we should start hiding new fields in the ufunc and dtype
>     structs as soon as possible going forward. (I.e. they would not be
>     present in the version of the structs that are exposed through the
>     C API, but internally we would use a more detailed struct.)
>     - Mayyyyyybe we should even go ahead and hide the subset of the
>       existing fields that are really internal details that no-one
>       should be using. If we did this without changing anything else
>       then it would preserve ABI (the fields would still be where
>       existing compiled extensions expect them to be, if any such
>       extensions exist) while breaking API (trying to compile such
>       extensions would give a clear error), so would be a smoother
>       ramp if we think we need to eventually break those fields for
>       real. (As discussed above, there are a bunch of fields in the
>       dtype base class that only make sense for specific dtype
>       subclasses, e.g. only record dtypes need a list of field names,
>       but right now all dtypes have one anyway. So it would be nice to
>       remove these from the base class entirely, but that is
>       potentially ABI-breaking.)
>
>   - Resolved: np.array should never return an object array unless
>     explicitly requested (e.g. with dtype=object); it just causes too
>     many surprising problems.
>     - First step: add a deprecation warning
>     - Eventually: make it an error.
>
>   - The matrix class
>     - Resolved: We won't add warnings yet, but we will prominently
>       document that it is deprecated and should be avoided where-ever
>       possible.
>       - Stéfan van der Walt volunteers to do this.
>     - We'd all like to deprecate it properly, but the feeling was that
>       the precondition for this is for scipy.sparse to provide sparse
>       "arrays" that don't return np.matrix objects on ordinary
>       operatoins. Until that happens we can't reasonably tell people
>       that using np.matrix is a bug.
>
>   - Resolved: we should add a similar prominent note to the
>     "subclassing ndarray" documentation, warning people that this is
>     painful and barely works and please don't do it if you have any
>     alternatives.
>
>   - Resolved: we want more, smaller releases -- every 6 months at
>     least, aiming to go even faster (every 4 months?)
>
>   - On the question of using Cython inside numpy core:
>     - Everyone agrees that there are places where this would be an
>       improvement (e.g., Python<->C interfaces, and places "when you
>       want to do computer science", e.g. complicated algorithmic stuff
>       like graph traversals)
>     - Chuck wanted it to be clear though that he doesn't think it
>       would be a good goal to try and rewrite all of numpy in Cython
>       -- there also exist places where Cython ends up being "an uglier
>       version of C". No-one disagreed.
>
>   - Our text reader is apparently not very functional on Python 3, and
>     generally slow and hard to work with.
>     - Resolved: We should extract Pandas's awesome text reader/parser
>       and convert it into its own package, that could then become a
>       new backend for both pandas and numpy.loadtxt.
>     - Jeff thinks this is a great idea
>     - Thomas Caswell volunteers to do the extraction.
>
>   - We should work on improving our tools for evolving the ABI, so
>     that we will eventually be less constrained by decisions made
>     decades ago.
>     - One idea that had a lot of support was to switch from our
>       current append-only C-API to a "sliding window" API based on
>       explicit versions. So a downstream package might say
>
>       #define NUMPY_API_VERSION 4
>
>       and they'd get the functions and behaviour provided in "version
>       4" of the numpy C api. If they wanted to get access to new stuff
>       that was added in version 5, then they'd need to switch that
>       #define, and at the same time clean up any usage of stuff that
>       was removed or changed in version 5. And to provide a smooth
>       migration path, one version of numpy would support multiple
>       versions at once, gradually deprecating and dropping old
>       versions.
>
>     - If anyone wants to help bring pip up to scratch WRT tracking ABI
>       dependencies (e.g., 'pip install numpy==<version with new ABI>'
>       -> triggers rebuild of scipy against the new ABI), then that
>       would be an extremely useful thing.
>
>
> Policies that should be documented
> ==================================
>
>   ...together with some notes about what the contents of the document
>   should be:
>
>
> How we manage bugs in the bug tracker.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>   - Github "milestones" should *only* be assigned to release-blocker
>     bugs (which mostly means "regression from the last release").
>
>     In particular, if you're tempted to push a bug forward to the next
>     release... then it's clearly not a blocker, so don't set it to the
>     next release's milestone, just remove the milestone entirely.
>
>     (Obvious exception to this: deprecation followup bugs where we
>     decide that we want to keep the deprecation around a bit longer
>     are a case where a bug actually does switch from being a blocker
>     to release 1.x to being a blocker for release 1.(x+1).)
>
>   - Don't hesitate to close an issue if there's no way forward --
>     e.g. a PR where the author has disappeared. Just post a link to
>     this policy and close, with a polite note that we need to keep our
>     tracker useful as a todo list, but they're welcome to re-open if
>     things change.
>
>
> Deprecations and breakage policy:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>   - How long do we need to keep DeprecationWarnings around before we
>     break things? This is tricky because on the one hand an aggressive
>     (short) deprecation period lets us deliver new features and
>     important cleanups more quickly, but on the other hand a
>     too-aggressive deprecation period is difficult for our more
>     conservative downstream users.
>
>     - Idea that had the most support: pick a somewhat-aggressive
>       warning period as our default, and make a rule that if someone
>       asks for an extension during the beta cycle for the release that
>       removes it, then we put it back for another release or two worth
>       of grace period. (While also possibly upgrading the warning to
>       be more visible during the grace period.) This gives us
>       deprecation periods that are more adaptive on a case-by-case
>       basis.
>
>   - Lament: it would be really nice if we could get more people to
>     test our beta releases, because in practice right now 1.x.0 ends
>     up being where we actually the discover all the bugs, and 1.x.1 is
>     where it actually becomes usable. Which sucks, and makes it
>     difficult to have a solid policy about what counts as a
>     regression, etc. Is there anything we can do about this?
>

Just a note in here - have you all thought about running the test suites
for downstream projects as part of the numpy test suite?

Thanks so much for the summary - lots of interesting ideas in here!

>
>   - ABI breakage: we distinguish between an ABI break that breaks
>     everything (e.g., "import scipy" segfaults), versus an ABI break
>     that breaks an occasional rare case (e.g., only apps that poke
>     around in some obscure corner of some struct are affected).
>
>     - The "break-the-world" type remains off-limit for now: the pain
>       is still too large (conda helps, but there are lots of people
>       who don't use conda!), and there aren't really any compelling
>       improvements that this would enable anyway.
>
>     - For the "break-0.1%-of-users" type, it is *not* ruled out by
>       fiat, though we remain conservative: we should treat it like
>       other API breaks in principle, and do a careful case-by-case
>       analysis of the details of the situation, taking into account
>       what kind of code would be broken, how common these cases are,
>       how important the benefits are, whether there are any specific
>       mitigation strategies we can use, etc. -- with this process of
>       course taking into account that a segfault is nastier than a
>       Python exception.
>
>
> Other points that were discussed
> ================================
>
>   - There was inconclusive discussion of what we should do with dot()
>     in the places where it disagrees with the PEP 465 matmul semantics
>     (specifically this is when both arguments have ndim >= 3, or one
>     argument has ndim == 0).
>     - The concern is that the current behavior is not very useful, and
>       as far as we can tell no-one is using it; but, as people get
>       used to the more-useful PEP 465 behavior, they will increasingly
>       try to use it on the assumption that np.dot will work the same
>       way, and this will create pain for lots of people. So Nathaniel
>       argued that we should start at least issuing a visible warning
>       when people invoke the corner-case behavior.
>     - But OTOH, np.dot is such a core piece of infrastructure, and
>       there's such a large landscape of code out there using numpy
>       that we can't see, that others were reasonably wary of making
>       any change.
>     - For now: document prominently, but no change in behavior.
>
>
> Links to raw notes
> ==================
>
>   Main page:
>   [https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]
>
>   Notes from the meeting proper:
>   [
> https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1g4Y/edit?usp=sharing
> ]
>
>   Slides from the followup BoF:
>   [
> https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c582485c80fb5b8e68183b/future-of-numpy-bof.odp
> ]
>
>   Notes from the followup BoF:
>   [
> https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt33w/edit
> ]
>
> -n
>
> --
> Nathaniel J. Smith -- http://vorpus.org
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150825/cf685660/attachment.html>