how to optimize object creation/reading from file?

perfreem at gmail.com perfreem at gmail.com
Wed Jan 28 09:28:53 EST 2009


hi,

i am doing a series of very simple string operations on lines i am
reading from a large file (~15 million lines). i store the result of
these operations in a simple instance of a class, and then put it
inside of a hash table. i found that this is unusually slow... for
example:

class myclass(object):
    __slots__ = ("a", "b", "c", "d")
    def __init__(self, a, b, c, d):
        self.a = a
        self.b = b
        self.c = c
        self.d = d
    def __str__(self):
        return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
    def __hash__(self):
        return hash((self.a, self.b, self.c, self.d))
    def __eq__(self, other):
        return (self.a == other.a and \
                self.b == other.b and \
                self.c == other.c and \
                self.d == other.d)
    __repr__ = __str__

n = 15000000
table = defaultdict(int)
t1 = time.time()
for k in range(1, n):
    myobj = myclass('a' + str(k), 'b', 'c', 'd')
    table[myobj] = 1
t2 = time.time()
print "time: ", float((t2-t1)/60.0)

this takes a very long time to run: 11 minutes!. for the sake of the
example i am not reading anything from file here but in my real code i
do. also, i do 'a' + str(k) but in my real code this is some simple
string operation on the line i read from the file. however, i found
that the above code shows the real bottle neck, since reading my file
into memory (using readlines()) takes only about 4 seconds. i then
have to iterate over these lines, but i still think that is more
efficient than the 'for line in file' approach which is even slower.

in the above code is there a way to optimize the creation of the class
instances ? i am using defaultdicts instead of ordinary ones so i dont
know how else to optimize that part of the code. is there a way to
perhaps optimize the way the class is written? if takes only 3 seconds
to read in 15 million lines into memory it doesnt make sense to me
that making them into simple objects while at it would take that much
more...



More information about the Python-list mailing list