Fast lookup of bulky "table"

Thomas Passin list1 at tompassin.net
Sun Jan 15 14:58:12 EST 2023


On 1/15/2023 2:39 PM, Peter J. Holzer wrote:
> On 2023-01-15 10:38:22 -0500, Thomas Passin wrote:
>> On 1/15/2023 6:14 AM, Peter J. Holzer wrote:
>>> On 2023-01-14 23:26:27 -0500, Dino wrote:
>>>> Anyway, my Flask service initializes by loading a big "table" of 100k rows
>>>> and 40 columns or so (memory footprint: order of 300 Mb)
>>>
>>> 300 MB is large enough that you should at least consider putting that
>>> into a database (Sqlite is probably simplest. Personally I would go with
>>> PostgreSQL because I'm most familiar with it and Sqlite is a bit of an
>>> outlier).
>>>
>>> The main reason for putting it into a database is the ability to use
>>> indexes, so you don't have to scan all 100 k rows for each query.
>>
>> I have an (inherited) server program that uses about 30 MB of data in a
>> MySQL database. It services queries received over the network. It too had
>> performance problems, to which adding indexes and smarter joins helped but
>> not enough.
>>
>> I changed the program so that at startup it imports much of the data into
>> Python dictionaries that are structured to support the kinds of queries that
>> need the help.  Response time to queries dropped dramatically.
> 
> This is to be expected: Firstly, because you don't have disk accesses
> any more, secondly because you don't have network latency any more and
> thirdly, because you structured the data to fit the queries.

Of course: that's exactly why I made those changes.  The tradeoff is 
using more memory for your program, sometimes a lot more.

> The thing to keep in mind is that the relational database model was
> invented to have a uniform and simple way to model all data, and that
> RDBMSs are designed to handle all workloads (from a single tiny table to
> thousands of tables with hundreds of terabytes) reasonably well. For any
> given application you can always find a more efficient solution than
> using an RDBMS. Sometimes it's simple (just load all the data into a
> dict and serve from there), sometimes it's a major research project.
> The nice thing about RDBMSs isn't that they are the optimal solution for
> anything but that they are a "good enough" solution for a large class of
> problems.

Often the solution is careful (and not very normalized) table design to 
support your queries. In the case I'm discussing, it was easier for me 
to make Python do the work, and I could afford the memory load.  In 
other cases, you have to put in the work on the database side.  Often 
for slow queries, disk latency and I/O are not the limiting factors, but 
you have to put in the work and do the testing to make sure.



More information about the Python-list mailing list