collections.Counter surprisingly slow

Roy Smith roy at panix.com
Sun Jul 28 15:59:04 EDT 2013


I've been doing an informal "intro to Python" lunchtime series for some 
co-workers (who are all experienced programmers, in other languages).  
This week I was going to cover list comprehensions, exceptions, and 
profiling. So, I did a little demo showing different ways to build a 
dictionary counting how many times a string appears in some input:

   test() implements a "look before you leap" python loop

   exception() uses a try/catch construct in a similar python loop

   default() uses a defaultdict

   count() uses a Counter

I profiled it, to show how the profiler works.  The full code is below. 
 The input is an 8.8 Mbyte file containing about 570,000 lines (11,000 
unique strings).  Python 2.7.3 on Ubuntu Precise.

As I expected, test() is slower than exception(), which is slower than 
default().  I'm rather shocked to discover that count() is the slowest 
of all!  I expected it to be the fastest.  Or, certainly, no slower 
than default().

The full profiler dump is at the end of this message, but the gist of 
it is:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    0.322    0.322 ./stations.py:42(count)
    1    0.159    0.159    0.159    0.159 ./stations.py:17(test)
    1    0.114    0.114    0.114    0.114 ./stations.py:27(exception)
    1    0.097    0.097    0.097    0.097 ./stations.py:36(default)

Why is count() [i.e. collections.Counter] so slow?

-------------------------------------------------------------
from collections import defaultdict, Counter

def main():
    lines = open('stations.txt').readlines()

    d1 = test(lines)
    d2 = exception(lines)
    d3 = default(lines)
    d4 = count(lines)

    print d1 == d2
    print d1 == d3
    print d1 == d4

def test(lines):
    d = {}
    for station in lines:
        if station in d:
            d[station] += 1
        else:
            d[station] = 1
    return d


def exception(lines):
    d = {}
    for station in lines:
        try:
            d[station] += 1
        except KeyError:
            d[station] = 1
    return d

def default(lines):
    d = defaultdict(int)
    for station in lines:
        d[station] += 1
    return d

def count(lines):
    d = Counter(lines)
    return d


if __name__ == '__main__':
    import cProfile
    import pstats
    cProfile.run('main()', 'stations.stats')
    p = pstats.Stats('stations.stats')
    p.sort_stats('cumulative').print_stats()
-------------------------------------------------------------

         570335 function calls (570332 primitive calls) in 0.776 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1    0.023    0.023    0.776    0.776 <string>:1(<module>)
   1    0.006    0.006    0.753    0.753 ./stations.py:5(main)
   1    0.000    0.000    0.322    0.322 ./stations.py:42(count)
   1    0.000    0.000    0.322    0.322 /usr/lib/python2.7/collections.py:407(__init__)
   1    0.242    0.242    0.322    0.322 /usr/lib/python2.7/collections.py:470(update)
   1    0.159    0.159    0.159    0.159 ./stations.py:17(test)
   1    0.114    0.114    0.114    0.114 ./stations.py:27(exception)
   1    0.097    0.097    0.097    0.097 ./stations.py:36(default)
   570285    0.080    0.000    0.080    0.000 {method 'get' of 'dict' objects}
   1    0.055    0.055    0.055    0.055 {method 'readlines' of 'file' objects}
   1    0.000    0.000    0.000    0.000 {isinstance}
   1    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/abc.py:128(__instancecheck__)
      2/1    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/abc.py:148(__subclasscheck__)
      3/1    0.000    0.000    0.000    0.000 {issubclass}
        4    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:58(__iter__)
        1    0.000    0.000    0.000    0.000 {open}
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:26(__exit__)
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:20(__enter__)
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:36(__init__)
        3    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:68(__contains__)
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:52(_commit_removals)
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:81(add)
        3    0.000    0.000    0.000    0.000 {getattr}
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_weakrefset.py:16(__init__)
        2    0.000    0.000    0.000    0.000 {method 'remove' of 'set' objects}
        2    0.000    0.000    0.000    0.000 {method '__subclasses__' of 'type' objects}
        2    0.000    0.000    0.000    0.000 /home/roy/deploy/current/python/lib/python2.7/_abcoll.py:97(__subclasshook__)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        4    0.000    0.000    0.000    0.000 {method 'add' of 'set' objects}



More information about the Python-list mailing list