[SciPy-Dev] NumPy/SciPy participation in GSoC 2013

Ralf Gommers ralf.gommers at gmail.com
Tue Apr 2 14:02:13 EDT 2013


On Tue, Apr 2, 2013 at 5:24 PM, Nathaniel Smith <njs at pobox.com> wrote:

> On Mon, Apr 1, 2013 at 12:58 PM, Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> > On Tue, Mar 26, 2013 at 12:27 AM, Ralf Gommers <ralf.gommers at gmail.com>
> > wrote:
> >>
> >>
> >>
> >>
> >> On Thu, Mar 21, 2013 at 10:20 PM, Ralf Gommers <ralf.gommers at gmail.com>
> >> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> It is the time of the year for Google Summer of Code applications. If
> we
> >>> want to participate with Numpy and/or Scipy, we need two things: enough
> >>> mentors and ideas for projects. If we get those, we'll apply under the
> PSF
> >>> umbrella. They've outlined the timeline they're working by and
> guidelines at
> >>>
> http://pyfound.blogspot.nl/2013/03/get-ready-for-google-summer-of-code.html
> .
> >>>
> >>> We should be able to come up with some interesting project ideas I'd
> >>> think, let's put those at
> >>> http://projects.scipy.org/scipy/wiki/SummerofCodeIdeas. Preferably
> with
> >>> enough detail to be understandable for people new to the projects and a
> >>> proposed mentor.
> >>>
> >>> We need at least 3 people willing to mentor a student. Ideally we'd
> have
> >>> enough mentors this week, so we can apply to the PSF on time. If you're
> >>> willing to be a mentor, please send me the following: name, email
> address,
> >>> phone nr, and what you're interested in mentoring. If you have time
> >>> constaints and have doubts about being able to be a primary mentor,
> being a
> >>> backup mentor would also be helpful.
> >>
> >>
> >> So far we've only got one primary mentor (thanks Chuck!), most core devs
> >> do not seem to have the bandwidth this year. If there are other people
> >> interested in mentoring please let me know. If not, then it looks like
> we're
> >> not participating this year.
> >
> >
> > Hi all, an update on GSoC'13. We do have enough mentoring power after
> all;
> > NumPy/SciPy is now registered as a participating project on the PSF page:
> > http://wiki.python.org/moin/SummerOfCode/2013
> >
> > Prospective students: please have a look at
> > http://wiki.python.org/moin/SummerOfCode/Expectations and at
> > http://projects.scipy.org/scipy/wiki/SummerofCodeIdeas. In particular
> note
> > that we require you to make one pull request to NumPy/SciPy which has to
> be
> > merged *before* the application deadline (May 3). So please start
> thinking
> > about that, and start a discussion on your project idea on this list.>
>
> It doesn't look like I have the appropriate mojo to edit that page, but:
>
> - The NA thing at the bottom should just be deleted, dropping a
> student into that would be cruel...
>
> - Some new entries, perhaps someone could add?
>
> ---------------
>
> == Performance parity between numpy arrays and Python scalars ==
>
> Small numpy arrays are very similar to Python scalars -- but numpy
> incurs a fair amount of extra overhead for simple operations. For
> large arrays this doesn't matter, but for code that manipulates a lot
> of small pieces of data, it can be a serious bottleneck. For example:
> {{{
> In [1]: x = 1.0
>
> In [2]: numpy_x = np.asarray(x)
>
> In [3]: timeit x + x
> 10000000 loops, best of 3: 61 ns per loop
>
> In [4]: timeit numpy_x + numpy_x
> 1000000 loops, best of 3: 1.66 us per loop
> }}}
>
> This project would involve profiling simple operations like the above,
> determining where the bottlenecks are, and devising improved
> algorithms to solve them, with the goal of getting the numpy time as
> close as possible to the Python time. Not only would this make all
> numpy-using code faster, but it would pave the way for future
> simplifications in numpy's core, which currently has a lot of
> duplicate code that attempts to work around these slow paths instead
> of fixing them properly.
>
> Some possible concrete changes:
> 1. numpy's "ufunc loop lookup code" (which is used to determine, e.g.,
> whether to use the integer or floating-point versions of "+") is slow
> and inefficient.
> 2. Checking for floating point errors is very slow; we can and should
> do it less often.
> 3. When allocating the return value, the "+" for Python floats calls
> malloc() only once; numpy calls it twice (once for the array object
> itself, and a second time for the array data). Stashing both objects
> within a single allocation would be more efficient.
> 4. ...see what profiling says! We know 61 ns is possible.
>
> == Pythonic dtypes ==
>
> A numpy "dtype" is an object that knows how to work with different
> sorts of values, represented as fixed-length packed binary values. For
> example, the int32 dtype knows how to convert the Python object '-1'
> to the four-byte buffer 0xff 0xff 0xff 0xff.
>
> Conceptually, dtype objects are arranged into a nice type hierarchy:
> http://docs.scipy.org/doc/numpy/_images/dtype-hierarchy.png
>
> But implementation-wise, dtypes don't use the Python class system at
> all. There's just a single Python class (numpy.dtype), and all dtypes
> are instances of it. (This is because when numpy was first designed,
> they only expected there to be maybe 20 dtype objects total.) This
> turns out to cause a number of problems -- you can't define new dtypes
> from Python, only from C; you can't use isinstance to compare dtypes
> (you have to use a hacky numpy-specific API instead); different dtypes
> can't easily contain state (instead, the single dtype class has
> gradually sprouted new fields as new dtypes turned out to need them);
> etc. Basically we've been reinventing the Python class system, poorly.
>
> The goal for this project is to turn dtype classes into regular Python
> classes with a proper type hierarchy and using the standard Python
> mechanisms.
>
> Longer term goals (at least the first of which is probably achievable
> within the SoC timeline):
> 1. Allow for defining new dtypes using pure Python.
> 2. There are a bunch of special cases in the ufunc code for handling
> strings and record arrays; we should make the appropriate extensions
> to the dtype API so that they can become regular dtypes.
> 3. A proper categorical data dtype. (This is trivial once the above is
> done.)
> 4. NA dtypes
>
> ------------------------------
>

Added those. Maybe one of the Trac admins can fix your edit rights.

>
> I'm sort of tempted to propose "sparse ndarrays" as a project, but I
> think that's too ambitious and would just end in tears...


That indeed sounds way too ambitious. Besides, there's already the idea of
fixing up scipy.sparse to be much more consistent with ndarrays. That will
yield some of the same benefits and should be a lot easier to do.

Ralf


> I guess the only way to allow for incremental deliverables would be to
> develop it
> as a separate package, with the incremental pieces being changes to
> numpy core that made it possible to hook in such a deep change (e.g.,
> new ufunc looping structure) from out-of-package code.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20130402/22e19272/attachment.html>


More information about the SciPy-Dev mailing list