how to optimize object creation/reading from file?

Wed Jan 28 10:06:13 EST 2009

perfreem at gmail.com a écrit :
> hi,
> 
> i am doing a series of very simple string operations on lines i am
> reading from a large file (~15 million lines). i store the result of
> these operations in a simple instance of a class, and then put it
> inside of a hash table. i found that this is unusually slow... for
> example:
> 
> class myclass(object):
>     __slots__ = ("a", "b", "c", "d")
>     def __init__(self, a, b, c, d):
>         self.a = a
>         self.b = b
>         self.c = c
>         self.d = d
>     def __str__(self):
>         return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
>     def __hash__(self):
>         return hash((self.a, self.b, self.c, self.d))
>     def __eq__(self, other):
>         return (self.a == other.a and \
>                 self.b == other.b and \
>                 self.c == other.c and \
>                 self.d == other.d)
>     __repr__ = __str__

If your class really looks like that, a tuple would be enough.

> n = 15000000
> table = defaultdict(int)
> t1 = time.time()
> for k in range(1, n):

hint : use xrange instead.

>     myobj = myclass('a' + str(k), 'b', 'c', 'd')
>     table[myobj] = 1

hint : if all you want is to ensure unicity, use a set instead.

> t2 = time.time()
> print "time: ", float((t2-t1)/60.0)

hint : use timeit instead.

> this takes a very long time to run: 11 minutes!. for the sake of the
> example i am not reading anything from file here but in my real code i
> do. also, i do 'a' + str(k) but in my real code this is some simple
> string operation on the line i read from the file. however, i found
> that the above code shows the real bottle neck, since reading my file
> into memory (using readlines()) takes only about 4 seconds. i then
> have to iterate over these lines, but i still think that is more
> efficient than the 'for line in file' approach which is even slower.

iterating over the file, while indeed a bit slower on a per-line basis, 
avoid useless memory comsuption which can lead to disk swapping - so for 
  "huge" files, it might still be better wrt/ overall performances.

> in the above code is there a way to optimize the creation of the class
> instances ? i am using defaultdicts instead of ordinary ones so i dont
> know how else to optimize that part of the code. is there a way to
> perhaps optimize the way the class is written? if takes only 3 seconds
> to read in 15 million lines into memory it doesnt make sense to me
> that making them into simple objects while at it would take that much
> more...

Did you bench the creation of a 15.000.000 ints list ?-)

But anyway, creating 15.000.000 instances (which is not a small number) 
of your class takes many seconds - 23.466073989868164 seconds on my 
(already heavily loaded) machine. Building the same number of tuples 
only takes about 2.5 seconds - that is, almost 10 times less. FWIW, 
tuples have all the useful characteristics of your above class (wrt/ 
hashing and comparison).

My 2 cents...