cPickle.load vs. file.read+cPickle.loads on large binary files

andrea.gavana at gmail.com andrea.gavana at gmail.com
Tue Nov 17 08:26:49 EST 2015


Hello List,

     I am working with relatively humongous binary files (created via cPickle), and I stumbled across some unexpected (for me) performance differences between two approaches I use to load those files:

1. Simply use cPickle.load(fid)

2. Read the file as binary using file.read() and then use cPickle.loads on the resulting output

In the snippet below, the MakePickle function is a dummy function that generates a relatively big binary file with cPickle (WARNING: around 3 GB) in the current directory. I am using NumPy arrays to make the file big but my original data structure is much more complicated, and things like HDF5 or databases are currently not an option - I'd like to stay with pickles.

The ReadPickle function simply uses cPickle.load(fid) on the opened binary file, and on my PC it takes about 2.3 seconds (approach 1).

The ReadPlusLoads function reads the file using file.read() and then use cPickle.loads on the resulting output (approach 2). On my PC, the file.read() process takes 15 seconds (!!!) and the cPickle.loads only 1.5 seconds.

What baffles me is the time it takes to read the file using file.read(): is there any way to slurp it all in one go (somehow) into a string ready for cPickle.loads without that much of an overhead?

Note that all of this has been done on Windows 7 64bit with Python 2.7 64bit, with 16 cores and 100 GB RAM (so memory should not be a problem).

Thank you in advance for all suggestions :-) .

Andrea.


# Begin code

import os, sys
import time
import cPickle
import numpy


class Dummy(object):

    def __init__(self, name):

        self.name = name
        self.data = numpy.random.rand(200, 600, 10)


def MakePickle():

    num_objects = 300
    list_of_objects = []

    for index in xrange(num_objects):
        dummy = Dummy('dummy_%d'%index)
        list_of_objects.append(dummy)

    fid = open('dummy.pkl', 'wb')

    start = time.time()
    out = cPickle.dumps(list_of_objects, cPickle.HIGHEST_PROTOCOL)
    end = time.time()
    print 'cPickle.dumps time:', end-start
    start = end
    fid.write(out)
    end = time.time()
    print 'file.write time:', end-start
    fid.close()


def ReadPickle():

    fid = open('dummy.pkl', 'rb')

    start = time.time()
    out = cPickle.load(fid)
    end = time.time()
    print 'cPickle.load time:', end-start
    fid.close()


def ReadPlusLoads():

    start = time.time()
    fid = open('dummy.pkl', 'rb')
    strs = fid.read()
    fid.close()
    end = time.time()
    print 'file.read time:', end-start
    start = end
    out = cPickle.loads(strs)
    end = time.time()
    print 'cPickle.loads time:', end-start


if __name__ == '__main__':
    ReadPickle()
    ReadPlusLoads()

# End code



More information about the Python-list mailing list