Fast lookup of bulky "table"

Mon Jan 16 13:18:57 EST 2023

Il giorno domenica 15 gennaio 2023 alle 05:26:50 UTC+1 Dino ha scritto:
> Hello, I have built a PoC service in Python Flask for my work, and - now 
> that the point is made - I need to make it a little more performant (to 
> be honest, chances are that someone else will pick up from where I left 
> off, and implement the same service from scratch in a different language 
> (GoLang? .Net? Java?) but I am digressing). 
> 
> Anyway, my Flask service initializes by loading a big "table" of 100k 
> rows and 40 columns or so (memory footprint: order of 300 Mb) and then 
> accepts queries through a REST endpoint. Columns are strings, enums, and 
> numbers. Once initialized, the table is read only. The endpoint will 
> parse the query and match it against column values (equality, 
> inequality, greater than, etc.) Finally, it will return a (JSON) list of 
> all rows that satisfy all conditions in the query. 
> 
> As you can imagine, this is not very performant in its current form, but 
> performance was not the point of the PoC - at least initially. 
> 
> Before I deliver the PoC to a more experienced software architect who 
> will look at my code, though, I wouldn't mind to look a bit less lame 
> and do something about performance in my own code first, possibly by 
> bringing the average time for queries down from where it is now (order 
> of 1 to 4 seconds per query on my laptop) to 1 or 2 milliseconds on 
> average). 
> 
> To be honest, I was already able to bring the time down to a handful of 
> microseconds thanks to a rudimentary cache that will associate the 
> "signature" of a query to its result, and serve it the next time the 
> same query is received, but this may not be good enough: 1) queries 
> might be many and very different from one another each time, AND 2) I am 
> not sure the server will have a ton of RAM if/when this thing - or 
> whatever is derived from it - is placed into production. 
> 
> How can I make my queries generally more performant, ideally also in 
> case of a new query? 
> 
> Here's what I have been considering: 
> 
> 1. making my cache more "modular", i.e. cache the result of certain 
> (wide) queries. When a complex query comes in, I may be able to restrict 
> my search to a subset of the rows (as determined by a previously cached 
> partial query). This should keep the memory footprint under control. 
> 
> 2. Load my data into a numpy.array and use numpy.array operations to 
> slice and dice my data. 
> 
> 3. load my data into sqlite3 and use SELECT statement to query my table. 
> I have never used sqllite, plus there's some extra complexity as 
> comparing certain colum requires custom logic, but I wonder if this 
> architecture would work well also when dealing with a 300Mb database. 
> 
> 4. Other ideas? 
> 
> Hopefully I made sense. Thank you for your attention 
> 
> Dino

As a comparison with numpy. Given the following lines:

import numpy as np
a = np.random.randn(400,100_000)
ia = np.argsort(a[0,:])
a_elem = a[56, ia[0]]

I have just taken an element randomly in a numeric table of 400x100000 elements
To find it with numpy:

%timeit isel = a == a_elem
35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And 
%timeit a[isel]
9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As data are not ordered it is searching it one by one but at C level.
Of course it depends on a lot of thing...