memory consumption

Cameron Simpson cs at cskk.id.au
Tue Mar 30 22:45:07 EDT 2021


Since everyone is talking about vague OS memory use and not at all about 
working set size of Python objects, let me ...

On 29Mar2021 03:12, Alexey <zen.supagood at gmail.com> wrote:
>I'm experiencing problems with memory consumption.
>
>I have a class which is doing ETL job. What`s happening inside:
> - fetching existing objects from DB via SQLAchemy

Do you need to? Or do you only need to fetch their ids? Or do you only 
need to fetch a subset of the objects?

It is easy to accidentally suck in way too many db session entity 
objects, or at any rate, more than you need to.

> - iterate over raw data

Can you prescan the data to determine which objects you care about, 
reducing the number of objects you need to obtain?

> - create new/update existing objects

Depoending what you're doing, you may not need to "create new/update 
existing objects". You could collate changes and do an UPSERT (the 
incantation varies a little depending on the SQL dialect behind 
SQLAlchemy).

> - commit changes

Do you discard the SQLAlchemy session after this? Otherwise it may lurk 
and hold onto the objects. Commit doesn't forget the objects.

For my current client we have a script to import historic data from a 
legacy system. It has many of the issues you're dealing with: the naive 
(ORM) way consumes gads of memory, and can be very slow too (udating 
objects in an ad hoc manner tends to do individual UPDATE SQL commands, 
very latency laden).

I wrote a generic batch UPSERT function which took an accrued list of 
changes and prepared a PostgreSQL INSERT...ON CONFLICT statement. The 
main script hands it the accrued updates and it runs batches (which lets 
up do progress reporting). Orders of magnitude faster, _and_ does not 
require storing the db objects.

On the subject of "fetching existing objects from DB via SQLAchemy": you 
may not need to do that, either. Can you identify _which_ objects are of 
interest? Associate with the same script I've go a batch_select 
function: it takes an terable if object ids and collects them in 
batches, where before we were really scanning the whole db because we 
had an arbitrary scattering of relevant object ids from the raw data.

It basicly collected ids into batches, and ran a SELECT...WHERE id in 
(batch-of-ids). It's really fast considering, and also scales _way_ down 
when the set of arbitrary ids is small.

I'm happy to walk through the mechanics of these with you; the code at 
this end is Django's ORM, but I prefer SQLAlchemy anyway - the project 
dictated the ORM here.

>Before processing data I create internal cache(dictionary) and store all existing objects in it.
>Every 10000 items I do bulk insert and flush. At the end I run commit command.

Yah. I suspect the session data are not being released. Also, SQLAlchemy 
may be caching sessions or something across runs, since this is a celery 
worker which survives from one task to the next.

You could try explicitly creating a new SQLAlchemy session around your 
task.

>Problem. Before executing, my interpreter process weighs ~100Mb, after first run memory increases up to 500Mb
>and after second run it weighs 1Gb. If I will continue to run this class, memory wont increase, so I think
>it's not a memory leak, but rather Python wont release allocated memory back to OS. Maybe I'm wrong.

I don't know enough about Python's "release OS memory" phase. But 
reducing the task memory footprint will help regardless.

Cheers,
Cameron Simpson <cs at cskk.id.au>


More information about the Python-list mailing list