Fast lookup of bulky "table"

Dino dino at no.spam.ar
Sun Jan 15 08:27:29 EST 2023


Thank you, Peter. Yes, setting up my own indexes is more or less the 
idea of the modular cache that I was considering. Seeing others think in 
the same direction makes it look more viable.

About Scalene, thank you for the pointer. I'll do some research.

Do you have any idea about the speed of a SELECT query against a 100k 
rows / 300 Mb Sqlite db?

Dino

On 1/15/2023 6:14 AM, Peter J. Holzer wrote:
> On 2023-01-14 23:26:27 -0500, Dino wrote:
>> Hello, I have built a PoC service in Python Flask for my work, and - now
>> that the point is made - I need to make it a little more performant (to be
>> honest, chances are that someone else will pick up from where I left off,
>> and implement the same service from scratch in a different language (GoLang?
>> .Net? Java?) but I am digressing).
>>
>> Anyway, my Flask service initializes by loading a big "table" of 100k rows
>> and 40 columns or so (memory footprint: order of 300 Mb)
> 
> 300 MB is large enough that you should at least consider putting that
> into a database (Sqlite is probably simplest. Personally I would go with
> PostgreSQL because I'm most familiar with it and Sqlite is a bit of an
> outlier).
> 
> The main reason for putting it into a database is the ability to use
> indexes, so you don't have to scan all 100 k rows for each query.
> 
> You may be able to do that for your Python data structures, too: Can you
> set up dicts which map to subsets you need often?
> 
> There are some specialized in-memory bitmap implementations which can be
> used for filtering. I've used
> [Judy bitmaps](https://judy.sourceforge.net/doc/Judy1_3x.htm) in the
> past (mostly in Perl).
> These days [Roaring Bitmaps](https://www.roaringbitmap.org/) is probably
> the most popular. I see several packages on PyPI - but I haven't used
> any of them yet, so no recommendation from me.
> 
> Numpy might also help. You will still have linear scans, but it is more
> compact and many of the searches can probably be done in C and not in
> Python.
> 
>> As you can imagine, this is not very performant in its current form, but
>> performance was not the point of the PoC - at least initially.
> 
> For performanc optimization it is very important to actually measure
> performance, and a good profiler helps very much in identifying hot
> spots. Unfortunately until recently Python was a bit deficient in this
> area, but [Scalene](https://pypi.org/project/scalene/) looks promising.
> 
>          hp
> 



More information about the Python-list mailing list