[SciPy-Dev] NumPy/SciPy participation in GSoC 2013

Tue Apr 2 11:24:25 EDT 2013

On Mon, Apr 1, 2013 at 12:58 PM, Ralf Gommers <ralf.gommers at gmail.com> wrote:
> On Tue, Mar 26, 2013 at 12:27 AM, Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
>>
>>
>>
>>
>> On Thu, Mar 21, 2013 at 10:20 PM, Ralf Gommers <ralf.gommers at gmail.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>> It is the time of the year for Google Summer of Code applications. If we
>>> want to participate with Numpy and/or Scipy, we need two things: enough
>>> mentors and ideas for projects. If we get those, we'll apply under the PSF
>>> umbrella. They've outlined the timeline they're working by and guidelines at
>>> http://pyfound.blogspot.nl/2013/03/get-ready-for-google-summer-of-code.html.
>>>
>>> We should be able to come up with some interesting project ideas I'd
>>> think, let's put those at
>>> http://projects.scipy.org/scipy/wiki/SummerofCodeIdeas. Preferably with
>>> enough detail to be understandable for people new to the projects and a
>>> proposed mentor.
>>>
>>> We need at least 3 people willing to mentor a student. Ideally we'd have
>>> enough mentors this week, so we can apply to the PSF on time. If you're
>>> willing to be a mentor, please send me the following: name, email address,
>>> phone nr, and what you're interested in mentoring. If you have time
>>> constaints and have doubts about being able to be a primary mentor, being a
>>> backup mentor would also be helpful.
>>
>>
>> So far we've only got one primary mentor (thanks Chuck!), most core devs
>> do not seem to have the bandwidth this year. If there are other people
>> interested in mentoring please let me know. If not, then it looks like we're
>> not participating this year.
>
>
> Hi all, an update on GSoC'13. We do have enough mentoring power after all;
> NumPy/SciPy is now registered as a participating project on the PSF page:
> http://wiki.python.org/moin/SummerOfCode/2013
>
> Prospective students: please have a look at
> http://wiki.python.org/moin/SummerOfCode/Expectations and at
> http://projects.scipy.org/scipy/wiki/SummerofCodeIdeas. In particular note
> that we require you to make one pull request to NumPy/SciPy which has to be
> merged *before* the application deadline (May 3). So please start thinking
> about that, and start a discussion on your project idea on this list.>

It doesn't look like I have the appropriate mojo to edit that page, but:

- The NA thing at the bottom should just be deleted, dropping a
student into that would be cruel...

- Some new entries, perhaps someone could add?

---------------

== Performance parity between numpy arrays and Python scalars ==

Small numpy arrays are very similar to Python scalars -- but numpy
incurs a fair amount of extra overhead for simple operations. For
large arrays this doesn't matter, but for code that manipulates a lot
of small pieces of data, it can be a serious bottleneck. For example:
{{{
In [1]: x = 1.0

In [2]: numpy_x = np.asarray(x)

In [3]: timeit x + x
10000000 loops, best of 3: 61 ns per loop

In [4]: timeit numpy_x + numpy_x
1000000 loops, best of 3: 1.66 us per loop
}}}

This project would involve profiling simple operations like the above,
determining where the bottlenecks are, and devising improved
algorithms to solve them, with the goal of getting the numpy time as
close as possible to the Python time. Not only would this make all
numpy-using code faster, but it would pave the way for future
simplifications in numpy's core, which currently has a lot of
duplicate code that attempts to work around these slow paths instead
of fixing them properly.

Some possible concrete changes:
1. numpy's "ufunc loop lookup code" (which is used to determine, e.g.,
whether to use the integer or floating-point versions of "+") is slow
and inefficient.
2. Checking for floating point errors is very slow; we can and should
do it less often.
3. When allocating the return value, the "+" for Python floats calls
malloc() only once; numpy calls it twice (once for the array object
itself, and a second time for the array data). Stashing both objects
within a single allocation would be more efficient.
4. ...see what profiling says! We know 61 ns is possible.

== Pythonic dtypes ==

A numpy "dtype" is an object that knows how to work with different
sorts of values, represented as fixed-length packed binary values. For
example, the int32 dtype knows how to convert the Python object '-1'
to the four-byte buffer 0xff 0xff 0xff 0xff.

Conceptually, dtype objects are arranged into a nice type hierarchy:
http://docs.scipy.org/doc/numpy/_images/dtype-hierarchy.png

But implementation-wise, dtypes don't use the Python class system at
all. There's just a single Python class (numpy.dtype), and all dtypes
are instances of it. (This is because when numpy was first designed,
they only expected there to be maybe 20 dtype objects total.) This
turns out to cause a number of problems -- you can't define new dtypes
from Python, only from C; you can't use isinstance to compare dtypes
(you have to use a hacky numpy-specific API instead); different dtypes
can't easily contain state (instead, the single dtype class has
gradually sprouted new fields as new dtypes turned out to need them);
etc. Basically we've been reinventing the Python class system, poorly.

The goal for this project is to turn dtype classes into regular Python
classes with a proper type hierarchy and using the standard Python
mechanisms.

Longer term goals (at least the first of which is probably achievable
within the SoC timeline):
1. Allow for defining new dtypes using pure Python.
2. There are a bunch of special cases in the ufunc code for handling
strings and record arrays; we should make the appropriate extensions
to the dtype API so that they can become regular dtypes.
3. A proper categorical data dtype. (This is trivial once the above is done.)
4. NA dtypes

------------------------------

I'm sort of tempted to propose "sparse ndarrays" as a project, but I
think that's too ambitious and would just end in tears... I guess the
only way to allow for incremental deliverables would be to develop it
as a separate package, with the incremental pieces being changes to
numpy core that made it possible to hook in such a deep change (e.g.,
new ufunc looping structure) from out-of-package code.

-n