GC is very expensive: am I doing something wrong?

Xavier Ho contact at xavierho.com
Thu Mar 18 20:24:22 EDT 2010


Weeble,

Try to use the full arguments of insert(i, x), instead of using list slices.
Every time you create a slice, Python copies the list into a new memory
location with the sliced copy. That's probably a big performance impact
there if done recursively.

My 2cp,
Xav


On Fri, Mar 19, 2010 at 10:13 AM, Weeble <clockworksaint at gmail.com> wrote:

> I am loading a dictionary from a text file and constructing a trie
> data structure in memory. However, it takes longer than I'm happy with
> - about 12 seconds on my computer. I profiled it, came up with some
> clever ideas to cut down on the work (such as by exploiting the fact
> that the dictionary is sorted) and was only able to shave a small
> fraction of the time off. However, then I tried calling gc.disable()
> before loading the trie and it halved the running time! I was
> surprised. Is that normal? I thought that the cost of garbage
> collection would be in some way proportional to the amount of garbage
> created, but I didn't think I was creating any: as far as I can tell
> the only objects deallocated during the load are strings, which could
> not be participating in cycles.
>
> I have spent a lot of time in C#, where the standard advice is not to
> mess about with the garbage collector because you'll probably just
> make things worse. Does that advice apply in Python? Is it a bad idea
> to call gc.disable() before loading the trie and then enable it again
> afterwards? Does the fact that the garbage collector is taking so much
> time indicate I'm doing something particularly bad here? Is there some
> way to give the garbage collector a hint that the whole trie is going
> to be long-lived and get it promoted straight to generation 2 rather
> than scanning it over and over?
>
> $ python
> Python 2.6.4 (r264:75706, Dec  7 2009, 18:43:55)
> [GCC 4.4.1] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>>
> $ time python -c "import
> trie;t=trie.Trie();t.load(open('sowpods.txt'))"
>
> real    0m12.523s
> user    0m12.380s
> sys     0m0.140s
> $ time python -c "import
> trie;t=trie.Trie();t.load(open('sowpods.txt'))"
>
> real    0m12.592s
> user    0m12.480s
> sys     0m0.110s
> $ time python -c "import gc;gc.disable();import
> trie;t=trie.Trie();t.load(open('sowpods.txt'))"
>
> real    0m6.176s
> user    0m5.980s
> sys     0m0.190s
> $ time python -c "import gc;gc.disable();import
> trie;t=trie.Trie();t.load(open('sowpods.txt'))"
>
> real    0m6.331s
> user    0m5.530s
> sys     0m0.170s
>
>
> === trie.py ===
>
> class Trie(object):
>    __slots__=("root", "active")
>    def __init__(self):
>        self.root=[]
>        self.active=False
>    def insert(self, word):
>        if len(word) == 0:
>            self.active=True
>        else:
>            head = word[0]
>            for ch, child in reversed(self.root):
>                if ch == head:
>                    child.insert(word[1:])
>                    return
>            child = Trie()
>            self.root.append((head, child))
>            child.insert(word[1:])
>    def seek(self, word):
>        if len(word) == 0:
>            return self
>        head = word[0]
>        for ch, child in self.root:
>            if ch == head:
>                return child.seek(word[1:])
>        return EMPTY_TRIE
>    def load(self, file):
>        for line in file:
>            self.insert(line.strip().lower())
>    def empty(self):
>        return (not self.root) and not self.active
>    def endings(self, prefix=""):
>        if self.active:
>            yield prefix
>        for ch, child in self.root:
>            for ending in child.endings(prefix+ch):
>                yield ending
>
> EMPTY_TRIE = Trie()
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100319/2013db8c/attachment-0001.html>


More information about the Python-list mailing list