[Pandas-dev] DataFrame.value_counts
Daniel Saxton
dsaxton at pm.me
Sun Sep 15 15:17:06 EDT 2019
Currently in pandas if we want to count the values for a single column of a DataFrame we would use df["a"].value_counts(), but when we want to count combinations of more than one column we (as far as I know) have to switch syntax and use df.groupby(["a", "b"]).size(). This is a little awkward code-wise and likely carries some unnecessary overhead since we don't actually need to prepare a groupby object that can handle an arbitrary calculation on the subframes. There's some evidence of this overhead in the Series case:
import numpy as np
import pandas as pd
s = pd.Series(np.random.randint(1, 10, 10**6))
%timeit s.value_counts()
# 6.74 ms ± 78.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit s.groupby(s).size()
# 11.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I think it would be useful and more efficient if there was a DataFrame.value_counts method, which could take a required columns argument indicating the combinations over which we want to count. This seems like a common enough operation that it might be worthwhile to add this functionality, but wanted to see what other opinions there were on this. I know pandas already has a huge number of methods and it's good to resist adding more, but I would see this more as "filling out" rather than "adding to" the API.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190915/6c2259c8/attachment.html>
More information about the Pandas-dev
mailing list