Fast lookup of bulky "table"

Thomas Passin list1 at tompassin.net
Sun Jan 15 10:38:22 EST 2023


On 1/15/2023 6:14 AM, Peter J. Holzer wrote:
> On 2023-01-14 23:26:27 -0500, Dino wrote:
>> Hello, I have built a PoC service in Python Flask for my work, and - now
>> that the point is made - I need to make it a little more performant (to be
>> honest, chances are that someone else will pick up from where I left off,
>> and implement the same service from scratch in a different language (GoLang?
>> .Net? Java?) but I am digressing).
>>
>> Anyway, my Flask service initializes by loading a big "table" of 100k rows
>> and 40 columns or so (memory footprint: order of 300 Mb)
> 
> 300 MB is large enough that you should at least consider putting that
> into a database (Sqlite is probably simplest. Personally I would go with
> PostgreSQL because I'm most familiar with it and Sqlite is a bit of an
> outlier).
> 
> The main reason for putting it into a database is the ability to use
> indexes, so you don't have to scan all 100 k rows for each query.

I have an (inherited) server program that uses about 30 MB of data in a 
MySQL database. It services queries received over the network. It too 
had performance problems, to which adding indexes and smarter joins 
helped but not enough.

I changed the program so that at startup it imports much of the data 
into Python dictionaries that are structured to support the kinds of 
queries that need the help.  Response time to queries dropped 
dramatically.  Some kinds of queries needed more help, and I collected 
auxiliary collections of (usually highly pre-processed) data into 
ordinary files, and those too get imported into dictionaries during startup.

Note that these dictionaries do not always match the table structures. 
Some of them change the structure to make queries easier to process. You 
may be able to do that with Python code, or by creating SQL views in the 
database and importing directly from the views (database views take 
almost no database memory).

The drawback is that all that data is now stored in memory while the 
program is running.  In my case, many hundreds of MB.  But if it would 
be too much memory for you - you would need to prototype it to know - 
you should let the database engine do the work. It is more highly 
optimized and efficient for searches than your code could ever be. But 
there will be a price to pay. The price is in denormalizing the database 
table design.  This means to include redundant data, organized to match 
the kinds of queries that will be made. No more 3rd normal form! Your 
sql queries will need to be designed to take advantage of this new 
structure. This will be a cost because the database will be larger, but 
also because the redundancies will make it much harder to update the 
data correctly.  Fortunately you do not need to do that during normal 
operation (my program's data was also static like yours).

PostgreSQL would probably be a better choice than Sqlite, since it 
supports features such as foreign keys, and has a function definition 
capability.





More information about the Python-list mailing list