[Numpy-discussion] Notes from the numpy dev meeting at scipy 2015
Nathan Goldbaum
nathan12343 at gmail.com
Tue Aug 25 12:52:42 EDT 2015
On Tue, Aug 25, 2015 at 5:03 AM, Nathaniel Smith <njs at pobox.com> wrote:
> Hi all,
>
> These are the notes from the NumPy dev meeting held July 7, 2015, at
> the SciPy conference in Austin, presented here so the list can keep up
> with what happens, and so you can give feedback. Please do give
> feedback, none of this is final!
>
> (Also, if anyone who was there notices anything I left out or
> mischaracterized, please speak up -- these are a lot of notes I'm
> trying to gather together, so I could easily have missed something!)
>
> Thanks to Jill Cowan and the rest of the SciPy organizers for donating
> space and organizing logistics for us, and to the Berkeley Institute
> for Data Science for funding travel for Jaime, Nathaniel, and
> Sebastian.
>
>
> Attendees
> =========
>
> Present in the room for all or part: Daniel Allan, Chris Barker,
> Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
> Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
> pretty sure this list is incomplete)
>
> Joining remotely for all or part: Stephan Hoyer, Julian Taylor.
>
>
> Formalizing our governance/decision making
> ==========================================
>
> This was a major focus of discussion. At a high level, the consensus
> was to steal IPython's governance document ("IPEP 29") and modify it
> to remove its use of a BDFL as a "backstop" to normal community
> consensus-based decision, and replace it with a new "backstop" based
> on Apache-project-style consensus voting amongst the core team.
>
> I'll send out a proper draft of this shortly for further discussion.
>
>
> Development roadmap
> ===================
>
> General consensus:
>
> Let's assume NumPy is going to remain important indefinitely, and
> try to make it better, instead of waiting for something better to
> come along. (This is unlikely to be wasted effort even if something
> better does come along, and it's hardly a sure thing that that will
> happen anyway.)
>
> Let's focus on evolving numpy as far as we can without major
> break-the-world changes (no "numpy 2.0", at least in the foreseeable
> future).
>
> And, as a target for that evolution, let's change our focus from
> numpy as "NumPy is the library that gives you the np.ndarray object
> (plus some attached infrastructure)", to "NumPy provides the
> standard framework for working with arrays and array-like objects in
> Python"
>
> This means, creating defined interfaces between array-like objects /
> ufunc objects / dtype objects, so that it becomes possible for third
> parties to add their own and mix-and-match. Right now ufuncs are
> pretty good at this, but if you want a new array class or dtype then
> in most cases you pretty much have to modify numpy itself.
>
> Vision: instead of everyone who wants a new container type having to
> reimplement all of numpy, Alice can implement an array class using
> (sparse / distributed / compressed / tiled / gpu / out-of-core /
> delayed / ...) storage, pass it to code that was written using
> direct calls to np.* functions, and it just works. (Instead of
> np.sin being "the way you calculate the sine of an ndarray", it's
> "the way you calculate the sine of any array-like container
> object".)
>
> Vision: Darryl can implement a new dtype for (categorical data /
> astronomical dates / integers-with-missing-values / ...) without
> having to touch the numpy core.
>
> Vision: Chandni can then come along and combine them by doing
>
> a = alice_array([...], dtype=darryl_dtype)
>
> and it just works.
>
> Vision: no-one is tempted to subclass ndarray, because anything you
> can do with an ndarray subclass you can also easily do by defining
> your own new class that implements the "array protocol".
>
>
> Supporting third-party array types
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Sub-goals:
> - Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
> API right there.
> - Go through the rest of the stuff in numpy, and figure out some
> story for how to let it handle third-party array classes:
> - ufunc ALL the things: Some things can be converted directly into
> (g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
> things could be converted into (g)ufuncs if we extended the
> (g)ufunc interface a bit (e.g. np.sort, np.matmul).
> - Some things probably need their own __numpy_ufunc__-like
> extensions (__numpy_concatenate__?)
> - Provide tools to make it easier to implement the more complicated
> parts of an array object (e.g. the bazillion different methods,
> many of which are ufuncs in disguise, or indexing)
> - Longer-run interesting research project: __numpy_ufunc__ requires
> that one or the other object have explicit knowledge of how to
> handle the other, so to handle binary ufuncs with N array types
> you need something like N**2 __numpy_ufunc__ code paths. As an
> alternative, if there were some interface that an object could
> export that provided the operations nditer needs to efficiently
> iterate over (chunks of) it, then you would only need N
> implementations of this interface to handle all N**2 operations.
>
> This would solve a lot of problems for projects like:
> - blosc
> - dask
> - distarray
> - numpy.ma
> - pandas
> - scipy.sparse
> - xray
>
>
> Supporting third-party dtypes
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We already have something like a C level "dtype
> protocol". Conceptually, the way you define a new dtype is by
> defining a new class whose instances have data attributes defining
> the parameters of the dtype (what fields are in *this* record dtype,
> how many characters are in *this* string dtype, what units are used
> for *this* datetime64, etc.), and you define a bunch of methods to
> do things like convert an object from a Python object to your dtype
> or vice-versa, to copy an array of your dtype from one place to
> another, to cast to and from your new dtype, etc. This part is
> great.
>
> The problem is, in the current implementation, we don't actually use
> the Python object system to define these classes / attributes /
> methods. Instead, all possible dtypes are jammed into a single
> Python-level class, whose struct has fields for the union of all
> possible dtype's attributes, and instead of Python-style method
> slots there's just a big table of function pointers attached to each
> object.
>
> So the main proposal is that we keep the basic design, but switch it
> so that the float64 dtype, the int64 dtype, etc. actually literally
> are subclasses of np.dtype, each implementing their own fields and
> Python-style methods.
>
> Some of the pieces involved in doing this:
>
> - The current dtype methods should be cleaned up -- e.g. 'dot' and
> 'less_than' are both dtype methods, when conceptually they're much
> more like ufuncs.
>
> - The ufunc inner-loop interface currently does not get a reference
> to the dtype object, so they can't see its attributes and this is
> a big obstacle to many interesting dtypes (e.g., it's hard to
> implement np.equal for categoricals if you don't know what
> categories each has). So we need to add new arguments to the core
> ufunc loop signature. (Fortunately this can be done in a
> backwards-compatible way.)
>
> - We need to figure out what exactly the dtype methods should be,
> and add them to the dtype class (possibly with backwards
> compatibility shims for anyone who is accessing PyArray_ArrFuncs
> directly).
>
> - Casting will be possibly the trickiest thing to work out, though
> the basic idea of using dunder-dispatch-like __cast__ and
> __rcast__ methods seems workable. (Encouragingly, this is also
> exactly what dynd also does, though unfortunately dynd does not
> yet support user-defined dtypes even to the extent that numpy
> does, so there isn't much else we can steal from them.)
> - We may also want to rethink the casting rules while we're at it,
> since they have some very weird corners right now (e.g. see
> [https://github.com/numpy/numpy/issues/6240])
>
> - We need to migrate the current dtypes over to the new system,
> which can be done in stages:
>
> - First stick them all in a single "legacy dtype" class whose
> methods just dispatch to the PyArray_ArrFuncs per-object "method
> table"
>
> - Then move each of them into their own classes
>
> - We should provide a Python-level wrapper for the protocol, so that
> you can call dtype methods from Python
>
> - And vice-versa, it should be possible to subclass dtype at the
> Python level
>
> - etc.
>
> Fortunately, AFAICT pretty much all of this can be done while
> maintaining backwards compatibility (though we may want to break
> some obscure cases to avoid expending *too* much effort with weird
> backcompat contortions that will only help a vanishingly small
> proportion of the userbase), and a lot of the above changes can be
> done as semi-independent mini-projects, so there's no need for some
> branch to go off and spend a year rewriting the world.
>
> Obviously there are still a lot of details to work out, though. But
> overall, there was widespread agreement that this is one of the #1
> pain points for our users (e.g. it's the single main request from
> pandas), and fixing it is very high priority.
>
> Some features that would become straightforward to implement
> (e.g. even in third-party libraries) if this were fixed:
> - missing value support
> - physical unit tracking (meters / seconds -> array of velocity;
> meters + seconds -> error)
> - better and more diverse datetime representations (e.g. datetimes
> with attached timezones, or using funky geophysical or
> astronomical calendars)
> - categorical data
> - variable length strings
> - strings-with-encodings (e.g. latin1)
> - forward mode automatic differentiation (write a function that
> computes f(x) where x is an array of float64; pass that function
> an array with a special dtype and get out both f(x) and f'(x))
> - probably others I'm forgetting right now
>
> I should also note that there was one substantial objection to this
> plan, from Travis Oliphant (in discussions later in the
> conference). I'm not confident I understand his objections well
> enough to reproduce them here, though -- perhaps he'll elaborate.
>
>
> Money
> =====
>
> There was an extensive discussion on the topic of: "if we had money,
> what would we do with it?"
>
> This is partially motivated by the realization that there are a
> number of sources that we could probably get money from, if we had a
> good story for what we wanted to do, so it's not just an idle
> question.
>
> Points of general agreement:
>
> - Doing the in-person meeting was a good thing. We should plan do
> that again, at least once a year. So one thing to spend money on
> is travel subsidies to make sure that happens and is productive.
>
> - While it's tempting to imagine hiring junior people for the more
> frustrating/boring work like maintaining buildbots, release
> infrastructure, updating docs, etc., this seems difficult to do
> realistically with our current resources -- how do we hire for
> this, who would manage them, etc.?
>
> - On the other hand, the general feeling was that if we found the
> money to hire a few more senior people who could take care of
> themselves more, then that would be good and we could
> realistically absorb that extra work without totally unbalancing
> the project.
>
> - A major open question is how we would recruit someone for a
> position like this, since apparently all the obvious candidates
> who are already active on the NumPy team already have other
> things going on. [For calibration on how hard this can be: NYU
> has apparently had an open position for a year with the job
> description of "come work at NYU full-time with a
> private-industry-competitive-salary on whatever your personal
> open-source scientific project is" (!) and still is having an
> extremely difficult time filling it:
> [http://cds.nyu.edu/research-engineer/]]
>
> - General consensus though was that there isn't much to be done
> about this though, except try it and see.
>
> - (By the way, if you're someone who's reading this and
> potentially interested in like a postdoc or better working on
> numpy, then let's talk...)
>
>
> More specific changes to numpy that had general consensus, but don't
> really fit into a high-level roadmap
>
> =========================================================================================================
>
> - Resolved: we should merge multiarray.so and umath.so into a single
> extension module, so that they can share utility code without the
> current awkward contortions.
>
> - Resolved: we should start hiding new fields in the ufunc and dtype
> structs as soon as possible going forward. (I.e. they would not be
> present in the version of the structs that are exposed through the
> C API, but internally we would use a more detailed struct.)
> - Mayyyyyybe we should even go ahead and hide the subset of the
> existing fields that are really internal details that no-one
> should be using. If we did this without changing anything else
> then it would preserve ABI (the fields would still be where
> existing compiled extensions expect them to be, if any such
> extensions exist) while breaking API (trying to compile such
> extensions would give a clear error), so would be a smoother
> ramp if we think we need to eventually break those fields for
> real. (As discussed above, there are a bunch of fields in the
> dtype base class that only make sense for specific dtype
> subclasses, e.g. only record dtypes need a list of field names,
> but right now all dtypes have one anyway. So it would be nice to
> remove these from the base class entirely, but that is
> potentially ABI-breaking.)
>
> - Resolved: np.array should never return an object array unless
> explicitly requested (e.g. with dtype=object); it just causes too
> many surprising problems.
> - First step: add a deprecation warning
> - Eventually: make it an error.
>
> - The matrix class
> - Resolved: We won't add warnings yet, but we will prominently
> document that it is deprecated and should be avoided where-ever
> possible.
> - Stéfan van der Walt volunteers to do this.
> - We'd all like to deprecate it properly, but the feeling was that
> the precondition for this is for scipy.sparse to provide sparse
> "arrays" that don't return np.matrix objects on ordinary
> operatoins. Until that happens we can't reasonably tell people
> that using np.matrix is a bug.
>
> - Resolved: we should add a similar prominent note to the
> "subclassing ndarray" documentation, warning people that this is
> painful and barely works and please don't do it if you have any
> alternatives.
>
> - Resolved: we want more, smaller releases -- every 6 months at
> least, aiming to go even faster (every 4 months?)
>
> - On the question of using Cython inside numpy core:
> - Everyone agrees that there are places where this would be an
> improvement (e.g., Python<->C interfaces, and places "when you
> want to do computer science", e.g. complicated algorithmic stuff
> like graph traversals)
> - Chuck wanted it to be clear though that he doesn't think it
> would be a good goal to try and rewrite all of numpy in Cython
> -- there also exist places where Cython ends up being "an uglier
> version of C". No-one disagreed.
>
> - Our text reader is apparently not very functional on Python 3, and
> generally slow and hard to work with.
> - Resolved: We should extract Pandas's awesome text reader/parser
> and convert it into its own package, that could then become a
> new backend for both pandas and numpy.loadtxt.
> - Jeff thinks this is a great idea
> - Thomas Caswell volunteers to do the extraction.
>
> - We should work on improving our tools for evolving the ABI, so
> that we will eventually be less constrained by decisions made
> decades ago.
> - One idea that had a lot of support was to switch from our
> current append-only C-API to a "sliding window" API based on
> explicit versions. So a downstream package might say
>
> #define NUMPY_API_VERSION 4
>
> and they'd get the functions and behaviour provided in "version
> 4" of the numpy C api. If they wanted to get access to new stuff
> that was added in version 5, then they'd need to switch that
> #define, and at the same time clean up any usage of stuff that
> was removed or changed in version 5. And to provide a smooth
> migration path, one version of numpy would support multiple
> versions at once, gradually deprecating and dropping old
> versions.
>
> - If anyone wants to help bring pip up to scratch WRT tracking ABI
> dependencies (e.g., 'pip install numpy==<version with new ABI>'
> -> triggers rebuild of scipy against the new ABI), then that
> would be an extremely useful thing.
>
>
> Policies that should be documented
> ==================================
>
> ...together with some notes about what the contents of the document
> should be:
>
>
> How we manage bugs in the bug tracker.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> - Github "milestones" should *only* be assigned to release-blocker
> bugs (which mostly means "regression from the last release").
>
> In particular, if you're tempted to push a bug forward to the next
> release... then it's clearly not a blocker, so don't set it to the
> next release's milestone, just remove the milestone entirely.
>
> (Obvious exception to this: deprecation followup bugs where we
> decide that we want to keep the deprecation around a bit longer
> are a case where a bug actually does switch from being a blocker
> to release 1.x to being a blocker for release 1.(x+1).)
>
> - Don't hesitate to close an issue if there's no way forward --
> e.g. a PR where the author has disappeared. Just post a link to
> this policy and close, with a polite note that we need to keep our
> tracker useful as a todo list, but they're welcome to re-open if
> things change.
>
>
> Deprecations and breakage policy:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> - How long do we need to keep DeprecationWarnings around before we
> break things? This is tricky because on the one hand an aggressive
> (short) deprecation period lets us deliver new features and
> important cleanups more quickly, but on the other hand a
> too-aggressive deprecation period is difficult for our more
> conservative downstream users.
>
> - Idea that had the most support: pick a somewhat-aggressive
> warning period as our default, and make a rule that if someone
> asks for an extension during the beta cycle for the release that
> removes it, then we put it back for another release or two worth
> of grace period. (While also possibly upgrading the warning to
> be more visible during the grace period.) This gives us
> deprecation periods that are more adaptive on a case-by-case
> basis.
>
> - Lament: it would be really nice if we could get more people to
> test our beta releases, because in practice right now 1.x.0 ends
> up being where we actually the discover all the bugs, and 1.x.1 is
> where it actually becomes usable. Which sucks, and makes it
> difficult to have a solid policy about what counts as a
> regression, etc. Is there anything we can do about this?
>
Just a note in here - have you all thought about running the test suites
for downstream projects as part of the numpy test suite?
Thanks so much for the summary - lots of interesting ideas in here!
>
> - ABI breakage: we distinguish between an ABI break that breaks
> everything (e.g., "import scipy" segfaults), versus an ABI break
> that breaks an occasional rare case (e.g., only apps that poke
> around in some obscure corner of some struct are affected).
>
> - The "break-the-world" type remains off-limit for now: the pain
> is still too large (conda helps, but there are lots of people
> who don't use conda!), and there aren't really any compelling
> improvements that this would enable anyway.
>
> - For the "break-0.1%-of-users" type, it is *not* ruled out by
> fiat, though we remain conservative: we should treat it like
> other API breaks in principle, and do a careful case-by-case
> analysis of the details of the situation, taking into account
> what kind of code would be broken, how common these cases are,
> how important the benefits are, whether there are any specific
> mitigation strategies we can use, etc. -- with this process of
> course taking into account that a segfault is nastier than a
> Python exception.
>
>
> Other points that were discussed
> ================================
>
> - There was inconclusive discussion of what we should do with dot()
> in the places where it disagrees with the PEP 465 matmul semantics
> (specifically this is when both arguments have ndim >= 3, or one
> argument has ndim == 0).
> - The concern is that the current behavior is not very useful, and
> as far as we can tell no-one is using it; but, as people get
> used to the more-useful PEP 465 behavior, they will increasingly
> try to use it on the assumption that np.dot will work the same
> way, and this will create pain for lots of people. So Nathaniel
> argued that we should start at least issuing a visible warning
> when people invoke the corner-case behavior.
> - But OTOH, np.dot is such a core piece of infrastructure, and
> there's such a large landscape of code out there using numpy
> that we can't see, that others were reasonably wary of making
> any change.
> - For now: document prominently, but no change in behavior.
>
>
> Links to raw notes
> ==================
>
> Main page:
> [https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]
>
> Notes from the meeting proper:
> [
> https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1g4Y/edit?usp=sharing
> ]
>
> Slides from the followup BoF:
> [
> https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c582485c80fb5b8e68183b/future-of-numpy-bof.odp
> ]
>
> Notes from the followup BoF:
> [
> https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt33w/edit
> ]
>
> -n
>
> --
> Nathaniel J. Smith -- http://vorpus.org
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150825/cf685660/attachment.html>
More information about the NumPy-Discussion
mailing list