[Numpy-discussion] indexing, searchsorting, ...

josef.pktd at gmail.com josef.pktd at gmail.com
Mon Jan 25 17:47:47 EST 2010


On Mon, Jan 25, 2010 at 5:16 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
> On Mon, Jan 25, 2010 at 1:38 PM, Jan Strube <curiousjan at gmail.com> wrote:
>> Dear List,
>>
>> I'm trying to speed up a piece of code that selects a subsample based on some criteria:
>> Setup:
>> I have two samples, raw and cut. Cut is a pure subset of raw, all elements in cut are also in raw, and cut is derived from raw by applying some cuts.
>> Now I would like to select a random subsample of raw and find out how many are also in cut. In other words, some of those random events pass the cuts, others don't.
>> So in principle I have
>>
>> randomSample = np.random.random_integers(0, len(raw)-1, size=sampleSize)
>> random_that_pass1 = [r for r in raw[randomSample] if r in cut]
>>
>> This is fine (I hope), but slow.
>
> You could construct raw2 and cut2 where each element placed in cut2 is
> removed from raw2:
>
> idx = np.random.rand(n_in_cut2) > 0.5  # for example
> raw2 = raw[~idx]
> cut2 = raw[idx]
>
> If you concatenate raw2 and cut2 you get raw (but reordered):
>
> raw3 = np.concatenate((raw2, cut2), axis=0)
>
> Any element in the subsample with an index of len(raw2) or greater is
> in cut. That makes counting fast.
>
> There is a setup cost. So I guess it all depends on how many
> subsamples you need from one cut.
>
> Not sure any of this works, just an idea.
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

in1d or intersect in arraysetops should also work, pure python but
well constructed and tested for performance.

Josef



More information about the NumPy-Discussion mailing list