[Python-Dev] counterintuitive behavior (bug?) in Counter with +=

Mon Oct 3 12:12:47 CEST 2011

Hello,

[First off, I'm not a member of this list, so please Cc: me in a reply!]

I've found some counterintuitive behavior in collections.Counter while
hacking on the scikit-learn project [1]. I wanted to use a bunch of
Counters to do some simple term counting in a set of documents,
roughly as follows:

    count_total = Counter()
    for doc in documents:
        count_current = Counter(analyze(doc))
        count_total += count_current
        count_per_doc.append(count_current)

Because we target Python 2.5+, I implemented a lightweight replacement
with just the functionality we need, including __iadd__, but then my
co-developer ran the above code on Python 2.7 and performance was
horrible. After some digging, I found out that Counter [2] does not
have __iadd__ and += copies the entire left-hand side in __add__!

I also figured out that I should use the update method instead, which
I will, but I still find that uglier than +=. I would submit a patch
to implement __iadd__, but I first want to know if that's considered
the right behavior, since it changes the semantics of +=:

    >>> from collections import Counter
    >>> a = Counter([1,2,3])
    >>> b = a
    >>> a += Counter([3,4,5])
    >>> a is b
    False

would become

    # snip
    >>> a is b
    True

TIA,
Lars

[1] https://github.com/scikit-learn/scikit-learn/commit/de6e93094499e4d81b8e3b15fc66b6b9252945af
[2] http://hg.python.org/cpython/file/tip/Lib/collections/__init__.py#l399

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam