[SciPy-dev] Ideas for scipy.sparse?

Sat Apr 12 11:37:31 EDT 2008

>  It seems to me that, in the interests of efficiency, it will not often
>  be a good idea to allocate data to processors on an element-by-element
>  basis; instead one will often want to allocate blocks of elements to
>  each processor. This will be very inefficiently represented by
>  ordinary sparse matrices, which take no advantage of ranges or
>  higher-dimensional blocks of nonzero entries. What's more, an opaque
>  sparse matrix library would be very frustrating in this context, since
>  you want to distinguish unavailable entries from entries that are
>  actually zero.

You are absolutely correct.  But the actual partitioning that is
optimal is determined by the particular applications.  We are trying
to write a distributed array library that is application neutral.
Thus, the partitioning algorithms are essentially arbitrary and we
have to design the rest of our library to work with these arbitrary
distributions.  We provide block, cyclic and block-cyclic
distributions by default, by the user can write their own partitioners
with very little constraint.  The only constrain we have is that the
data distributions are cartesian products along the different axes.

Of course, we select reasonable defaults for the user - the default
distribution is block distributed along the first axis.

>  It seems like what you need is some sort of proxy object that keeps
>  track of some chunks of an array and serves them up as requested -
>  possibly even as views of underlying numpy arrays - if available, and
>  calls some network message-passing code if they're not available.

Yes, the way we are designing our interface is that when a particular
processor needs certain global elements, it has to prefetch them
before actually using them - this is very much like the proxying you
are suggesting.  We will have block oriented prefetchers that fetch
blocks and store them as local dense numpy arrays, but we will also
have single element prefetchers - and for those sparse is the only way
to go.

The situation is similar to using numpy arrays - there are slow and
fast ways of using numpy arrays..  It is up to the user to choose for
their particular problem how to use numpy arrays.  But, numpy arrays
still work as expected in either case (even it you write inefficient
code).

>  How do existing distributed-array toolkits handle the problem?

They provide horribly complex interfaces and make way too many
assumptions (only vectors and matrices, only certain distributions) to
be useful to your average scientist  :)

Cheers,

Brian

>  Anne
>
>
> _______________________________________________
>  Scipy-dev mailing list
>  Scipy-dev at scipy.org
>  http://projects.scipy.org/mailman/listinfo/scipy-dev
>