GC is very expensive: am I doing something wrong?

Fri Mar 19 00:44:19 EDT 2010

On Mar 18, 7:13 pm, Weeble <clockworksa... at gmail.com> wrote:
> I am loading a dictionary from a text file and constructing a trie
> data structure in memory. However, it takes longer than I'm happy with
> - about 12 seconds on my computer. I profiled it, came up with some
> clever ideas to cut down on the work (such as by exploiting the fact
> that the dictionary is sorted) and was only able to shave a small
> fraction of the time off. However, then I tried calling gc.disable()
> before loading the trie and it halved the running time! I was
> surprised. Is that normal? I thought that the cost of garbage
> collection would be in some way proportional to the amount of garbage
> created, but I didn't think I was creating any: as far as I can tell
> the only objects deallocated during the load are strings, which could
> not be participating in cycles.

Well, you are creating and destroying a lot of objects in the process,
so that will provoke the garbage collector.  But you are also doing
reversing and searching, and that's slow.  Does your application
really need to be able to keep things in order like this, or do you
just want to know if a word is in the dictionary?  If you just want to
load up a dictionary and be able to see if words are in it, I would
use a dict instead of a list.
Even if you want to be able to print out the data in order, you might
consider using a dict instead of a list.  The print operation could
use one sort at the end, instead of searching all the nodes on
insertion.

You could also use a character that doesn't appear in your data as a
sentinel, and then you don't need a separate active indicator -- every
leaf node will be empty and be referred to by the sentinel character.

You are also doing a lot of recursive operations that could be done
without recursing.

Finally, do you really need to keep an additional object around for
each node in the tree?

I have modified your trie code to use a dict and a sentinel, while
keeping basically the same API.  This may or may not be the best way
to do this, depending on your other code which uses this data
structure.  It could also probably be made faster by removing the
setdefault, and not re-building objects when you need them, and even
this implementation will load faster if you disable the gc, but in any
case, this might give you some ideas about how to make your code go
faster.

Regards,
Pat

from collections import defaultdict

class TrieTop(object):
    sentinel = ' '  # Something not in the data

    def __init__(self, data=None):
        def defaultrecurse():
            return defaultdict(defaultrecurse)
        if data is None:
            data = defaultrecurse()
        self.data = data
    def insert(self, word):
        data = self.data
        for ch in word:
            data = data[ch]
        data[self.sentinel]
    def seek(self, word):
        data = self.data
        for ch in word:
            data = data.get(ch)
            if data is None:
                return EMPTY_TRIE
        return TrieTop(data)
    def load(self, file):
        for line in file:
            self.insert(line.strip().lower())
    def empty(self):
        return (not self.data)
    def endings(self, prefix=""):
        def recurse(data, prefix):
            if not data:
                yield prefix[:-1]
                return
            for ch, data in data.iteritems():
                for result in recurse(data, prefix + ch):
                    yield result
        return sorted(recurse(self.data, prefix))

EMPTY_TRIE = TrieTop()