Fast lookup of bulky "table"

Tue Jan 17 05:42:02 EST 2023

Il giorno martedì 17 gennaio 2023 alle 00:18:04 UTC+1 Dino ha scritto:
> On 1/16/2023 1:18 PM, Edmondo Giovannozzi wrote: 
> > 
> > As a comparison with numpy. Given the following lines: 
> > 
> > import numpy as np 
> > a = np.random.randn(400,100_000) 
> > ia = np.argsort(a[0,:]) 
> > a_elem = a[56, ia[0]] 
> > 
> > I have just taken an element randomly in a numeric table of 400x100000 elements 
> > To find it with numpy: 
> > 
> > %timeit isel = a == a_elem 
> > 35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
> > 
> > And 
> > %timeit a[isel] 
> > 9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
> > 
> > As data are not ordered it is searching it one by one but at C level. 
> > Of course it depends on a lot of thing...
> thank you for this. It's probably my lack of experience with Numpy, 
> but... can you explain what is going on here in more detail? 
> 
> Thank you 
> 
> Dino

Sorry, 
I was just creating an array of 400x100000 elements that I fill with random numbers:

  a = np.random.randn(400,100_000) 

Then I pick one element randomly, it is just a stupid sort on a row and then I take an element in another row, but it doesn't matter, I'm just taking a random element. I may have used other ways to get that but was the first that came to my mind.

 ia = np.argsort(a[0,:]) 
 a_elem = a[56, ia[0]] 

The I'm finding that element in the all the matrix a (of course I know where it is, but I want to test the speed of a linear search done on the C level):

%timeit isel = a == a_elem 

Actually isel is a logic array that is True where a[i,j] == a_elem and False where a[i,j] != a_elem. It may find more then one element but, of course, in our case it will find only the element that we have selected at the beginning. So it will give the speed of a linear search plus the time needed to allocate the logic array. The search is on the all matrix of 40 million of elements not just on one of its row of 100k element. 

On the single row (that I should say I have chosen to be contiguous) is much faster.

%timeit isel = a[56,:] == a_elem
26 µs ± 588 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

the matrix is a double precision numbers that is 8 byte, I haven't tested it on string of characters.

This wanted to be an estimate of the speed that one can get going to the C level. 
You loose of course the possibility to have a relational database, you need to have everything in memory, etc...

A package that implements tables based on numpy is pandas: https://pandas.pydata.org/

I hope that it can be useful.