Fast lookup of bulky "table"

Dino dino at no.spam.ar
Sat Jan 14 23:26:27 EST 2023


Hello, I have built a PoC service in Python Flask for my work, and - now 
that the point is made - I need to make it a little more performant (to 
be honest, chances are that someone else will pick up from where I left 
off, and implement the same service from scratch in a different language 
(GoLang? .Net? Java?) but I am digressing).

Anyway, my Flask service initializes by loading a big "table" of 100k 
rows and 40 columns or so (memory footprint: order of 300 Mb) and then 
accepts queries through a REST endpoint. Columns are strings, enums, and 
numbers. Once initialized, the table is read only. The endpoint will 
parse the query and match it against column values (equality, 
inequality, greater than, etc.) Finally, it will return a (JSON) list of 
all rows that satisfy all conditions in the query.

As you can imagine, this is not very performant in its current form, but 
performance was not the point of the PoC - at least initially.

Before I deliver the PoC to a more experienced software architect who 
will look at my code, though, I wouldn't mind to look a bit less lame 
and do something about performance in my own code first, possibly by 
bringing the average time for queries down from where it is now (order 
of 1 to 4 seconds per query on my laptop) to 1 or 2 milliseconds on 
average).

To be honest, I was already able to bring the time down to a handful of 
microseconds thanks to a rudimentary cache that will associate the 
"signature" of a query to its result, and serve it the next time the 
same query is received, but this may not be good enough: 1) queries 
might be many and very different from one another each time, AND 2) I am 
not sure the server will have a ton of RAM if/when this thing - or 
whatever is derived from it - is placed into production.

How can I make my queries generally more performant, ideally also in 
case of a new query?

Here's what I have been considering:

1. making my cache more "modular", i.e. cache the result of certain 
(wide) queries. When a complex query comes in, I may be able to restrict 
my search to a subset of the rows (as determined by a previously cached 
partial query). This should keep the memory footprint under control.

2. Load my data into a numpy.array and use numpy.array operations to 
slice and dice my data.

3. load my data into sqlite3 and use SELECT statement to query my table. 
I have never used sqllite, plus there's some extra complexity as 
comparing certain colum requires custom logic, but I wonder if this 
architecture would work well also when dealing with a 300Mb database.

4. Other ideas?

Hopefully I made sense. Thank you for your attention

Dino


More information about the Python-list mailing list