[Pandas-dev] pandas microperformance, do we care?

Tue Mar 29 03:50:08 EDT 2016

I've noticed there's been a slow degrading in pandas microperformance as
time has gone by. I looked into this when I found that df.icol(i) has been
deprecated in favor of df.iloc[:, i]

df = pd.DataFrame(np.random.randn(10, 5))

So here we go:

pandas v0.12

%timeit df.icol(2)
100000 loops, best of 3: 13.5 µs per loop

pandas v0.18

%timeit df.icol(2)
10000 loops, best of 3: 25.4 µs per loop

In [6]: timeit df.iloc[:, 2]
10000 loops, best of 3: 60.8 µs per loop

Once upon a time, I spent a lot of time shaving microseconds off some of
these data accessor methods.

For example, pandas v0.12 again:

In [17]: s = df[2]

In [18]: timeit s.get_value(5)
1000000 loops, best of 3: 609 ns per loop

In [21]: timeit s[5]
1000000 loops, best of 3: 860 ns per loop

And pandas v0.18

In [15]: timeit s.get_value(5)
100000 loops, best of 3: 7.17 µs per loop

In [16]: timeit s[5]
100000 loops, best of 3: 9.31 µs per loop

I understand that the performance was made worse in order to add various
layers of indirection in order to make new features available (and fix
bugs).

I'm hoping as part of looking at revamping pandas's internals (and closing
the gap to the "metal") that we are able to tighten up some of these "inner
loop" methods, preferably back to pandas 0.12-level performance. It's true
that writing a lot of Python for-loops isn't optimal for lots of reasons,
but we should avoid overly penalizing users when this does happen.

Thanks,
Wes
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160329/064cf674/attachment.html>