memory consumption

Thu Apr 1 11:38:29 EDT 2021

On 3/29/21 5:12 AM, Alexey wrote:
> Hello everyone!
> I'm experiencing problems with memory consumption.
> 
> I have a class which is doing ETL job. What`s happening inside:
>   - fetching existing objects from DB via SQLAchemy
>   - iterate over raw data
>   - create new/update existing objects
>   - commit changes
> 
> Before processing data I create internal cache(dictionary) and store all existing objects in it.
> Every 10000 items I do bulk insert and flush. At the end I run commit command.
> 
> Problem. Before executing, my interpreter process weighs ~100Mb, after first run memory increases up to 500Mb
> and after second run it weighs 1Gb. If I will continue to run this class, memory wont increase, so I think
> it's not a memory leak, but rather Python wont release allocated memory back to OS. Maybe I'm wrong.
> 
> What I tried after executing:
>   - gc.collect()
>   - created snapshots with tracemalloc and searched for some garbage, diff =
>     smapshot_before_run - smapshot_after_run
>   - searched for links with "objgraph" library to internal cache(dictionary
>     containing elements from DB)
>   - cleared the cache(dictionary)
>   - db.session.expire_all()
> 
> This class is a periodic celery task. So when each worker executes this class at least two times,
> all celery workers need 1Gb of RAM. Before celery there was a cron script and this class was executed via API call
> and the problem was the same. So no matter how I run, interpreter consumes 1Gb of RAM after two runs.
> 
> I see few solutions to this problem
> 1. Execute this class in separate process. But I had few errors when the same SQLAlchemy connection being shared
> between different processes.
> 2. Restart celery worker after executing this task by throwing exception.
> 3. Use separate queue for such tasks, but then worker will stay idle most of the time.
> All this is looks like a crutch. Do I have any other options ?
> 
> I'm using:
> Python - 3.6.13
> Celery - 4.1.0
> Flask-RESTful - 0.3.6
> Flask-SQLAlchemy - 2.3.2
> 
> Thanks in advance!
> 

I had the (mis)pleasure of dealing with a multi-terabyte postgresql 
instance many years ago and figuring out why random scripts were eating 
up system memory became quite common.

All of our "ETL" scripts were either written in Perl, Java, or Python 
but the results were always the same, if a process grew to using 1gb of 
memory (as your case), then it never "released" it back to the OS. What 
this basically means is that your script at one time did in fact 
use/need 1GB of memory. That becomes the "high watermark" and in most 
cases usage will stay at that level. And if you think about it, it makes 
sense. Your python program went through the trouble of requesting memory 
space from the OS, it makes no sense for it to give it back to the OS as 
if it needed 1GB in the past, it will probably need 1GB in the future so 
you will just waste time with syscalls. Even the glibc docs state 
calling free() does not necessarily mean that the OS will allocate the 
"freed" memory back to the global memory space.

There are basically two things you can try. First, try working in 
smaller batch sizes. 10,000 is a lot, try 100. Second, as you hinted, 
try moving the work to a separate process. The simple way to do this 
would be to move away from modules that use threads and instead use 
something that creates child processes with fork().

Regards,