Pickle caching objects?

José María Mateos chema at rinzewind.org
Sat Nov 30 17:05:45 EST 2019


Hi,

I just asked this question on the IRC channel but didn't manage to get a 
response, though some people replied with suggestions that expanded this 
question a bit.

I have a program that has to read some pickle files, perform some 
operations on them, and then return. The pickle objects I am reading all 
have the same structure, which consists of a single list with two 
elements: the first one is a long list, the second one is a numpy 
object.

I found out that, after calling that function, the memory taken by the 
Python executable (monitored using htop -- the entire thing runs on 
Python 3.6 on an Ubuntu 16.04, pretty standard conda installation with a 
few packages installed directly using `conda install`) increases in 
proportion to the size of the pickle object being read. My intuition is 
that that memory should be free upon exiting.

Does pickle keep a cache of objects in memory after they have been 
returned? I thought that could be the answer, but then someone suggested 
to measure the time it takes to load the objects. This is a script I 
wrote to test this; nothing(filepath) just loads the pickle file, 
doesn't do anything with the output and returns how long it took to 
perform the load operation.

---
import glob
import pickle
import timeit
import os
import psutil

def nothing(filepath):
    start = timeit.default_timer()
    with open(filepath, 'rb') as f:
        _ = pickle.load(f)
    return timeit.default_timer() - start

if __name__ == "__main__":

    filelist = glob.glob('/tmp/test/*.pk')

    for i, filepath in enumerate(filelist):
        print("Size of file {}: {}".format(i, os.path.getsize(filepath)))
        print("First call:", nothing(filepath))
        print("Second call:", nothing(filepath))
        print("Memory usage:", psutil.Process(os.getpid()).memory_info().rss)
        print()
---

This is the output of the second time the script was run, to avoid any 
effects of potential IO caches:

---
Size of file 0: 11280531
First call: 0.1466723980847746
Second call: 0.10044755204580724
Memory usage: 49418240

Size of file 1: 8955825
First call: 0.07904054620303214
Second call: 0.07996074995025992
Memory usage: 49831936

Size of file 2: 43727266
First call: 0.37741047400049865
Second call: 0.38176894187927246
Memory usage: 49758208

Size of file 3: 31122090
First call: 0.271301960805431
Second call: 0.27462846506386995
Memory usage: 49991680

Size of file 4: 634456686
First call: 5.526095286011696
Second call: 5.558765463065356
Memory usage: 539324416

Size of file 5: 3349952658
First call: 29.50982437795028
Second call: 29.461691531119868
Memory usage: 3443597312

Size of file 6: 9384929
First call: 0.0826977719552815
Second call: 0.08362263604067266
Memory usage: 3443597312

Size of file 7: 422137
First call: 0.0057482069823890924
Second call: 0.005949910031631589
Memory usage: 3443597312

Size of file 8: 409458799
First call: 3.562588643981144
Second call: 3.6001368327997625
Memory usage: 3441451008

Size of file 9: 44843816
First call: 0.39132978999987245
Second call: 0.398518088972196
Memory usage: 3441451008
---

Notice that memory usage increases noticeably specially on files 4 and 
5, the biggest ones, and doesn't come down as I would expect it to. But 
the loading time is constant, so I think I can disregard any pickle 
caching mechanisms.

So I guess now my question is: can anyone give me any pointers as to why 
is this happening? Any help is appreciated.

Thanks,

-- 
José María (Chema) Mateos || https://rinzewind.org/


More information about the Python-list mailing list