[SciPy-dev] Ideas for scipy.sparse?

Brian Granger ellisonbg.net at gmail.com
Tue Apr 15 12:37:55 EDT 2008


>  This looks really exciting. I looked at the project, and it seems like
>  you are using MPI as the basic infrastructure - but I also see
>  references to GASNet. My first question looking at this was, why not
>  just use Global Arrays for the basic array infrastructure ? It has a
>  bunch of stuff that would let you get quite far quickly:

Yes, we are using MPI.  The GASNet stuff was just playing around.  I
looked at global arrays as well, but (for now) have stuck with MPI for
the following reasons:

1) There are really stable, easy to install Python bindings for MPI
(mpi4py.scipy.org)

2) MPI is more available and easier to install.  OS X comes with the
standard and there are Linux packages for it.  This makes like easier
for our users who need to install it.

3) Everybody else is doing it

4) I tried installing GA on my Mac and the build failed.  I am more
than willing to debug an install process, but my users won't be.

5) There are not any good Python bindings for GA.  I know Robert
Harrison had some Python bindings, but I can't find them and they
probably are not being maintained.

That said, I would very much like to be able to user GA/GASNet stuff
in the future.

>  http://www.emsl.pnl.gov/docs/global/
>
>  I believe, looking at the project organization, that you plan to add
>  fft, sparse arrays, the works.

Ideally yes.

>  I must mention that I work with Star-P (www.interactivesupercomputing.com
>  ), and we have plugged in our parallel runtime to overload parts of
>  numpy and scipy. We have sparse support for M (matlab), but not for
>  python yet.

Were you at PyCon 2007?  This explains your interest in scipy.sparse.
I guess it would be very good for you if the stuff in scipy.sparse
moved into numpy?  Also, having that code base settle down would also
probably help.  What are your plans.

>  I am guessing that your framework allows one to run several instances
>  of python scripts as well on each node ? That way, one could solve
>  several ODEs across nodes. You'd need some dynamic scheduling for
>  workload balancing, perhaps.

Our "framework" is IPython1:

http://ipython.scipy.org/moin/IPython1

Yes, IPython1 enables Python scripts to be distributed across a set of
nodes.  We have both static and dynamic load balancing.  Compute nodes
can be started with MPI, but that is optional.  This allows compute
node to come and go anytime and the dynamic load balancer has fault
tolerance builtin.

>  Have you thought about integrating this with the stackless python
>  project, for a nice programming model that abstracts away from MPI and
>  allows high performance parallel programs to be written entirely in
>  python ?

Yes, but I am really hesitant to require people to install a custom
version of Python.  As I am sure you know, if users can't easily
install the tools you are building (especially if they are
parallel/distributed), it is a no go.  That kills stackless for us,
even though I really like the ideas behind stackless.  Another option
is greenlets.

> > Here is the main issue:  I am running into a need for sparse arrays.
>  > There are two places I am running into this:
>  >
>  > 1) I want to implement sparse distributed arrays and need sparse local
>  > arrays for this.
>  >
>  > 2) There are other places in the implementation where sparse arrays
>  > are needed.
>  >
>
> > Obviously, my first though was scipy.sparse.  I am _really_ excited
>  > about the massive improvements that have been happening in this area
>  > recently.  Here are the problems I am running into:
>  Not having followed the numpy/scipy debate for too long, I must say
>  that I was a little surprised to find sparse array support in scipy
>  rather than numpy.

Yep, the work that people have done on scipy.sparse (especially
Nathan) is fantastic.  But, I do think the basic sparse arrays should
be in numpy.

> > 1) I need N-dimensional sparse arrays.  Some of the storage formats in
>  > scipy.sparse (dok, coo, maybe lil) could be generalized to
>  > N-dimensions, but some work would have to be done.
>   From the discussion, it seemed to me that these formats are written
>  in python itself. In general, I like the idea of writing data
>  structures in the same language, and as much in higher level languages.
>
>  Some people pointed out that this tends to be slow when called in for
>  loops. I couldn't figure out what the cause of this was. Is it that
>  for loops in python are generally slow, or is it that indexing
>  individual elements in sparse data structures is slow or are python's
>  data structures slow, or some combination of all three ?

Compared to loops is C, Python loops are quite slow - especially if
you are calling Python functions at each iteration.

>  I can say something about Matlab's sparse matrices, which I am
>  familiar with. Its not that Matlab's for loops are the cause of
>  slowness (they may be slow), but that indexing single elements in a
>  sparse matrix is inherently slow. Every instance of A(i, j) has to
>  perform a binary search. Insertion is even more expensive because it
>  requires a memmove for CSC/CSR formats. Thus you almost never work
>  with for loops and single element indexing for sparse matrices in
>  Matlab. That said, one could design a data structure that partially
>  solves some of these problems.

True, there is the algorithmic complexity of sparse array indexing,
but that is on top of the overhead that Python (and likely Matlab)
has.

>  I thought initially that dok may solve this for some kinds of sparse
>  matrix problems, but it seems that its too slow for large problems.
>  Perhaps a little benchmark must be setup.

I think the dok implementation in scipy.sparse could be optimized, but
I don't have time to pursue this right now.

>  Brian,
>
>  For N-d are you thinking of something along the lines of Tammy Kolda's
>  tensor toolbox for Matlab ? That could be accomplished with just
>  distributed dense vectors - I'm not sure what the performance would be
>  like. Its the same as using a distributed coo format.
>
>  http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/
>  http://www.models.kvl.dk/source/nwaytoolbox/index.asp

I will have a look at this.

>  <snip>
>
> >
>  > 3) That we begin to move their implementation to using Cython (as an
>  > aside, cython does play very well with templated C++ code).  This
>  > could provide a much nicer way of tying into the c++ code than using
>  > swig.
>  Not being intimately familiar with cython or swig, I did take a look
>  at the cython homepage and it sounded like a good fit. It does seem
>  that some people have strong oppositions to it.  Support for templated
>  C++ code would be nice to have though !

While it is sort of a hack, you can handle templates in cython:

http://wiki.cython.org/WrappingCPlusPlus

Even though this is a hack and it tacks a while to get used to, it
works really well and people are starting to use it.  The big
advantage of this approach is that the generated C++ code underneath
the hood is _really_ fast.  There is almost no overhead that cython
adds.  This is much better than swig, which adds too many layers of
cruft and forces you to call things from Python.

In terms of opposition to cython, there is very little on the whole.
The overall pattern I am seeing is that _tons_ of people are beginning
to use it (mpi4py, ipython, numpy/scipy, nipy, sage, etc.).

Cheers,

>  -viral
>
>
>
>
>
>  _______________________________________________
>  Scipy-dev mailing list
>  Scipy-dev at scipy.org
>  http://projects.scipy.org/mailman/listinfo/scipy-dev
>



More information about the SciPy-Dev mailing list