[issue6370] Bad performance of colllections.Counter at initialisation from an iterable

Mon Jun 29 18:11:16 CEST 2009

Raymond Hettinger <rhettinger at users.sourceforge.net> added the comment:

Your proposed function assumes the input is a sequence and it cannot
handle not a general purpose iterable (it gives the wrong answer when
the latter is submitted).

Also, it does two lookups per item which can be expensive if the
elements have costly hash functions (such as Decimal objects, tuple
objects, and strings that haven't been interned).  

And, it does not return a dict subclass, so it cannot provide all of the
methods currently offered by Counter (such as most_common() and
elements()), nor does it override regular dict methods that do not make
sense for counters (such as the update() method).

Test data:

# case that give *wrong* answer because the input isn't reiterable
>>> unique(c.lower() for c in 'Abracadabra')

# case that is slower because of expensive multiple lookups
>>> unique([abs(Decimal(x)/4) for x in range(-2000, 2000)])

# case that fails because update() was not overridden
>>> c = unique(range(10))
>>> c.update(range(5))

# cases that do not provide needed subclass methods
>>> unique('abracadabra').most_common()
>>> unique('abracadabra').elements()

Though the code for unique() is hopeless, I will take a look at the
"self[elem] = self.get(elem, 0) + 1" approach.  That shows some promise.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue6370>
_______________________________________