optimizing large dictionaries

Thu Jan 15 17:18:51 EST 2009

On Fri, Jan 16, 2009 at 8:39 AM, Per Freem <perfreem at yahoo.com> wrote:

> hello
>
> i have an optimization questions about python. i am iterating through
> a file and counting the number of repeated elements. the file has on
> the order
> of tens of millions elements...
>
>
> for line in file:
>  try:
>    elt = MyClass(line)# extract elt from line...
>    my_dict[elt] += 1
>  except KeyError:
>    my_dict[elt] = 1
>
>
> class MyClass
>
>  def __str__(self):
>    return "%s-%s-%s" %(self.field1, self.field2, self.field3)
>
>  def __repr__(self):
>    return str(self)
>
>  def __hash__(self):
>    return hash(str(self))
>
>
> is there anything that can be done to speed up this simply code? right
> now it is taking well over 15 minutes to process, on a 3 Ghz machine
> with lots of RAM (though this is all taking CPU power, not RAM at this
> point.)
>
> any general advice on how to optimize large dicts would be great too
>
> thanks for your help.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
Hello,
You can get a large speedup by removing the need to instantiate a new
MyClass instance on
each iteration of your loop.
Instead define one MyClass with an 'interpret' method that would be called
instead of MyClass()
interpret would return the string '%s-%s-%s' % (self.field1 etc..)

i.e

myclass = MyClass()
interpret = myclass.interpret

for line in file:

 elt = interpet(line)# extract elt from line...
 try:
   my_dict[elt] += 1
 except KeyError:
   my_dict[elt] = 1

The speed up is on the order of 10 on my machine.

Cheers,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090116/73a05235/attachment-0001.html>