Fast lookup of bulky "table"

Tue Jan 17 10:39:14 EST 2023

Thanks a lot, Edmondo. Or better... Grazie mille.

On 1/17/2023 5:42 AM, Edmondo Giovannozzi wrote:
> 
> Sorry,
> I was just creating an array of 400x100000 elements that I fill with random numbers:
> 
>    a = np.random.randn(400,100_000)
> 
> Then I pick one element randomly, it is just a stupid sort on a row and then I take an element in another row, but it doesn't matter, I'm just taking a random element. I may have used other ways to get that but was the first that came to my mind.
> 
>   ia = np.argsort(a[0,:])
>   a_elem = a[56, ia[0]]
> 
> The I'm finding that element in the all the matrix a (of course I know where it is, but I want to test the speed of a linear search done on the C level):
> 
> %timeit isel = a == a_elem
> 
> Actually isel is a logic array that is True where a[i,j] == a_elem and False where a[i,j] != a_elem. It may find more then one element but, of course, in our case it will find only the element that we have selected at the beginning. So it will give the speed of a linear search plus the time needed to allocate the logic array. The search is on the all matrix of 40 million of elements not just on one of its row of 100k element.
> 
> On the single row (that I should say I have chosen to be contiguous) is much faster.
> 
> %timeit isel = a[56,:] == a_elem
> 26 µs ± 588 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 
> the matrix is a double precision numbers that is 8 byte, I haven't tested it on string of characters.
> 
> This wanted to be an estimate of the speed that one can get going to the C level.
> You loose of course the possibility to have a relational database, you need to have everything in memory, etc...
> 
> A package that implements tables based on numpy is pandas: https://pandas.pydata.org/
> 
> I hope that it can be useful.
> 
>