cPickle.load vs. file.read+cPickle.loads on large binary files
andrea.gavana at gmail.com
andrea.gavana at gmail.com
Tue Nov 17 08:26:49 EST 2015
Hello List,
I am working with relatively humongous binary files (created via cPickle), and I stumbled across some unexpected (for me) performance differences between two approaches I use to load those files:
1. Simply use cPickle.load(fid)
2. Read the file as binary using file.read() and then use cPickle.loads on the resulting output
In the snippet below, the MakePickle function is a dummy function that generates a relatively big binary file with cPickle (WARNING: around 3 GB) in the current directory. I am using NumPy arrays to make the file big but my original data structure is much more complicated, and things like HDF5 or databases are currently not an option - I'd like to stay with pickles.
The ReadPickle function simply uses cPickle.load(fid) on the opened binary file, and on my PC it takes about 2.3 seconds (approach 1).
The ReadPlusLoads function reads the file using file.read() and then use cPickle.loads on the resulting output (approach 2). On my PC, the file.read() process takes 15 seconds (!!!) and the cPickle.loads only 1.5 seconds.
What baffles me is the time it takes to read the file using file.read(): is there any way to slurp it all in one go (somehow) into a string ready for cPickle.loads without that much of an overhead?
Note that all of this has been done on Windows 7 64bit with Python 2.7 64bit, with 16 cores and 100 GB RAM (so memory should not be a problem).
Thank you in advance for all suggestions :-) .
Andrea.
# Begin code
import os, sys
import time
import cPickle
import numpy
class Dummy(object):
def __init__(self, name):
self.name = name
self.data = numpy.random.rand(200, 600, 10)
def MakePickle():
num_objects = 300
list_of_objects = []
for index in xrange(num_objects):
dummy = Dummy('dummy_%d'%index)
list_of_objects.append(dummy)
fid = open('dummy.pkl', 'wb')
start = time.time()
out = cPickle.dumps(list_of_objects, cPickle.HIGHEST_PROTOCOL)
end = time.time()
print 'cPickle.dumps time:', end-start
start = end
fid.write(out)
end = time.time()
print 'file.write time:', end-start
fid.close()
def ReadPickle():
fid = open('dummy.pkl', 'rb')
start = time.time()
out = cPickle.load(fid)
end = time.time()
print 'cPickle.load time:', end-start
fid.close()
def ReadPlusLoads():
start = time.time()
fid = open('dummy.pkl', 'rb')
strs = fid.read()
fid.close()
end = time.time()
print 'file.read time:', end-start
start = end
out = cPickle.loads(strs)
end = time.time()
print 'cPickle.loads time:', end-start
if __name__ == '__main__':
ReadPickle()
ReadPlusLoads()
# End code
More information about the Python-list
mailing list