[Numpy-discussion] In-place fancy selection

Charles R Harris charlesr.harris at gmail.com
Thu Mar 1 15:58:32 EST 2007


On 3/1/07, Francesc Altet <faltet at carabos.com> wrote:
>
> El dj 01 de 03 del 2007 a les 13:26 -0700, en/na Charles R Harris va
> escriure:
> >
> >
> > On 3/1/07, Francesc Altet <faltet at carabos.com> wrote:
> >         Hi,
> >
> >         I don't think there is a solution for this, but perhaps
> >         anybody may
> >         offer some idea. Given:
> >
> >         In [79]:a=numpy.arange(9,-1,-1)
> >         In [80]:b=numpy.arange(10)
> >         In [81]:numpy.random.shuffle(b)
> >         In [82]:b
> >         Out[82]:array([2, 6, 3, 5, 4, 9, 0, 8, 7, 1])
> >         In [83]:a=a[b]
> >         In [84]:a
> >         Out[84]:array([7, 3, 6, 4, 5, 0, 9, 1, 2, 8])
> >
> >         is there a way to make the step 83 without having to keep 3
> >         arrays
> >         in-memory at the same time? This is, some way of doing fancy
> >         indexing,
> >         but changing the elements *inplace*. The idea is to keep
> >         memory
> >         requeriments as low as possible when a and b are large arrays.
> >
> >         Thanks!
> >
> > I think that would be tough because of overlap between the two sides.
> > The permutation could be factored into cycles which would mostly avoid
> > that, but that is more theoretical than practical here. What is it you
> > are trying to do?
>
> Yeah, the problem is the overlap. Well, what I'm trying to do is, given
> two arrays on-disk (say, block and block_idx), sort one of them, and
> then, re-order the other following with the same order than the first
> one. My best approach until now is:
>
>             block = tmp_sorted[nslice]   # read block from disk
>             sblock_idx = block.argsort()
>             block.sort()
>             # do things with block...
>             del block  # get rid of block
>             block_idx = tmp_indices[nslice]  # read bock_idx from disk
>             indices[nslice] = block_idx[sblock_idx]
>
> but the last line will take 3 times the memory that takes block_idx
> alone. My goal would be that the algorithm above would take only 2 times
> the memory of block_idx, but I don't think this is going to be possible.


Argsort sorta does what you want, it is an indirect sort but the other array
is a list of integers and at present, it doesn't exchange the key array. If
you want a special case sort it wouldn't be too difficult to modify it to do
what you want. If you want to operate on small pieces disk to disk, an
indirect version of merge sort might be the way to go.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070301/c7f7f965/attachment.html>


More information about the NumPy-Discussion mailing list