cPickle.load vs. file.read+cPickle.loads on large binary files

andrea.gavana at gmail.com andrea.gavana at gmail.com
Tue Nov 17 09:20:03 EST 2015


Hi Peter,

On Tuesday, November 17, 2015 at 3:14:57 PM UTC+1, Peter Otten wrote:
> Andrea Gavana wrote:
> 
> > Hello List,
> > 
> >      I am working with relatively humongous binary files (created via
> >      cPickle), and I stumbled across some unexpected (for me) performance
> >      differences between two approaches I use to load those files:
> > 
> > 1. Simply use cPickle.load(fid)
> > 
> > 2. Read the file as binary using file.read() and then use cPickle.loads on
> > the resulting output
> > 
> > In the snippet below, the MakePickle function is a dummy function that
> > generates a relatively big binary file with cPickle (WARNING: around 3 GB)
> > in the current directory. I am using NumPy arrays to make the file big but
> > my original data structure is much more complicated, and things like HDF5
> > or databases are currently not an option - I'd like to stay with pickles.
> > 
> > The ReadPickle function simply uses cPickle.load(fid) on the opened binary
> > file, and on my PC it takes about 2.3 seconds (approach 1).
> > 
> > The ReadPlusLoads function reads the file using file.read() and then use
> > cPickle.loads on the resulting output (approach 2). On my PC, the
> > file.read() process takes 15 seconds (!!!) and the cPickle.loads only 1.5
> > seconds.
> > 
> > What baffles me is the time it takes to read the file using file.read():
> > is there any way to slurp it all in one go (somehow) into a string ready
> > for cPickle.loads without that much of an overhead?
> > 
> > Note that all of this has been done on Windows 7 64bit with Python 2.7
> > 64bit, with 16 cores and 100 GB RAM (so memory should not be a problem).
> > 
> > Thank you in advance for all suggestions :-) .
> > 
> > Andrea.
> > 
> > if __name__ == '__main__':
> >     ReadPickle()
> >     ReadPlusLoads()
> 
> Do you get roughly the same times when you execute ReadPlusLoads() before 
> ReadPIckle()?


Thank you for your answer. I do get similar timings when I swap the two functions, and specifically still 15 seconds to read the file via file.read() and 2.4 seconds (more or less as before) via cPickle.load(fid).

I thought that the order of operations might be an issue but apparently that was not the case...

Andrea.




More information about the Python-list mailing list