[Pandas-dev] DataFrame.value_counts

Daniel Saxton dsaxton at pm.me
Sun Sep 15 15:17:06 EDT 2019


Currently in pandas if we want to count the values for a single column of a DataFrame we would use df["a"].value_counts(), but when we want to count combinations of more than one column we (as far as I know) have to switch syntax and use df.groupby(["a", "b"]).size().  This is a little awkward code-wise and likely carries some unnecessary overhead since we don't actually need to prepare a groupby object that can handle an arbitrary calculation on the subframes.  There's some evidence of this overhead in the Series case:

import numpy as np
import pandas as pd

s = pd.Series(np.random.randint(1, 10, 10**6))

%timeit s.value_counts()
# 6.74 ms ± 78.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit s.groupby(s).size()
# 11.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I think it would be useful and more efficient if there was a DataFrame.value_counts method, which could take a required columns argument indicating the combinations over which we want to count.  This seems like a common enough operation that it might be worthwhile to add this functionality, but wanted to see what other opinions there were on this.  I know pandas already has a huge number of methods and it's good to resist adding more, but I would see this more as "filling out" rather than "adding to" the API.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190915/6c2259c8/attachment.html>


More information about the Pandas-dev mailing list