[SciPy-dev] slicing vs. advanced selection -- can be there smth in the middle? ; -)

Mon Jan 14 11:55:11 EST 2008

On 14/01/2008, Yaroslav Halchenko <lists at onerussian.com> wrote:

> At the moment, there are 2 possibilities to select sub region of an
> array
>
> 1. slicing -- efficient memory vise -- no copying is done, it is just a
> view over the original data (thus even .flags.owndata=False).  Needs to
> be done by specifying Slice instance(s) (explicitly or not) in the
> index, ie
> b=a[ 1:4, 2:5 ]
>
> 2. advanced selection where either a list of indexes is given or a mask
> c=a[ [1,2,3], [2,3,4] ]
>
> in this case the data gets copied

The first answer is that numpy cannot do what you want. Every numpy
array is a contiguous block of memory, with data elements spaced
evenly along it in each dimension. This is built  into the C-level
indexing throughout numpy. This is why fancy indexing *must* copy.
Thus there's no way to do what you want and get a proper numpy array
out. But read on...

If you're willing to be a little awkward, you can also make lists:

d = [ a[i,:] for i in [2,3,7] ]

Here the data does not get copied either. Unfortunately, you lose the
numpy features on the outer indexing; also note that lists introduce
an array of at least four or eight bytes per list element, so you
almost certainly do not want to use a list-of-lists.

> In the application we are developing (pymvpa)  we are dealing with
> relatively large arrays of data (primarily 2D while processing), where
> first dimension corresponds to different samples, 2nd -- to different
> features.
>
> The problems comes that we often need to sample from an array. For
> instance to check cross-validation on N-1 fold we are to generate N
> 'views' over original array. In each such "view" 1 sample (row) is not present
> while training, and it is used as a sample to test against later on.
> so at the end instead of N data records, in current implementation we
> end up with N*(N-1) records (if we are to keep those views for further
> analysis).
>
> But that is not only the case with the 1st dimension -- we have to do
> similarly 'evil' selection of the features, which again leads to quite
> a big waste of memory.
>
> Thus I wondered, is there any facility which could help us out (may be
> by sacrificing reasonable computation cost) and have really a view on
> top of an array. We don't really need a sparse representation -- we are
> selecting a set of rows and columns, so every column (and similar across
> rows) for a given 'view' uses the same steps/increments between
> the elements.

If I understand you correctly, your selections tend to be
"all-but-one" selections, though maybe in both dimensions. In this
case, you can get arrays that are two contiguous parts:

v = (a[:n],a[n+1:])

These can be views. Of course indexing them is more annoying, but here
you are trading convenience for runtime. If you like, you can
concatenate these, producing contiguous copies, before processing, and
then discard the copies.

Alternatively, if your need is simply to keep the selections around
for later analysis, remember that selection is a fast process, so you
can keep only the selection indices:

b = a[l1, l2]
analyze(b)
keep((l1,l2))

Or even whatever was used to generate them - the number of the omitted
row, or even a seed used to seed a random number generator to select
random rows.

Good luck,
Anne