[SciPy-User] Scipy views and slicing: Can I get a view-slice from only certain elements of an array?

Fri Oct 29 16:15:45 EDT 2010

Hi!

My question is on slicing and views.  I'd like to be able to create a
view of an array from some subset of indices. This can *almost* be
done using array slices as follows:

scores = scipy.ones((10,1))
subset = scores[:5]  # changes to subset will reflect in scores--
reference to the same object
subset[0] = 3
subset /= subset.sum()   #  renormalize subset, updating scores as well

I can also do fancy slicing and the reference ("view") to the original
array is intact...

scores = scipy.ones((10,1))
subset = scores[:6:2]   # elements 0,2,6
subset[0] = 3
subset /= subset.sum()  #  both subset and scores are updated, though subset is
                                    #  not a contiguous slice of scores

What I can't do is create a view with arbitrary indices:

scores = scipy.ones((10,1))
subset = scores[[1,5,7]]   # not a reference!
subset[0] = 3
subset /= subset.sum()  # does not update scores!

Is there a way to do this?  I've also tried:
subset = [scores[1:2], scores[5:6], scores[6:7]]  # these are
references, but the container is a list, not an array

# and the syntax is annoying
subset = scipy.array([scores[1:2], scores[5:6], scores[6:7])  # no
longer a reference...
subset = scipy.array([scores[1:2], scores[5:6], scores[6:7],
copy=True)  # also not a reference...

Any thoughts?

The data I'm working on is millions of short high-throughput
sequencing reads, each of which may have 2-100+ possible genomic
alignments.  Each alignment falls within a particular genomic bin
(~150 bases) but also has a probability associated with the alignment
(so the sum over all alignments for each read will be 1).  I need to
update all the alignments in a particular bin (from many different
reads) and then (once all bins are updated) renormalize all the
alignments for each read.  My current strategy is to have a single 1D
array with all the probabilities, then two lists with the indexes into
the large array-- one list stores the indices that fall within a
genomic bin, whereas the other list stores the indices associated with
a particular alignment.  This is working fine, but the memory
requirements are a bit high (1.5GB) and it's a bit slow since there
are millions of reads, meaning lots and lots of slices from that large
array.  I wonder if I could replace the indices in each list with a
view of the original array-- it seems that would save me a bit of
memory and would make the slicing faster.

Thanks for your help!

--
Jake Biesinger
Graduate Student
Xie Lab, UC Irvine
(949) 231-7587