[Pandas-dev] Help speeding up altered groubpy.value_counts

Tue Jul 28 16:27:29 EDT 2020

All,

I'm a new contributor to pandas and have been working to fix a couple bugs
with the value_counts methods (pull request
https://github.com/pandas-dev/pandas/pull/33652).  I'm looking for a bit of
help in maintaining speedy performance for
https://github.com/DataInformer/pandas-1/blob/value_counts_normalize/pandas/
core/groupby/generic.py

The SeriesGroupBy.value_counts method required significant rewrite in order
to achieve correct behavior with dropna and normalize.  After fixing that, I
was asked to run performance tests, which unfortunately do show a
significant performance hit for that method.  I have been looking at how to
close that gap as much as possible, but I've found only a few minor tweaks.
When I do cProfile, I don't notice any clear offenders: numpy array
functions are taking a lot of time total
(numpy.core._multiarray_umath.implement_array_function), but I don't see any
particular functions that are slow.  Similarly, timeit experiments suggest
that array concatenation is relatively slow, but not much different than
other options like appending in the next function (e.g. I can do something
like np.diff(np.nonzero(np.r_[changes, True])) or
np.diff(np.nonzero(changes), append=len(changes)) there's not much of a
timing difference).  I have tried to do as little as possible with the
multiindex, rebuilding it at the end.  I would welcome any help or
suggestions for how to make SeriesGroupBy.value_counts faster.

Thanks,

Evan Fuller

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200728/2ceb84d8/attachment.html>