Computations on pandas dataframes

Peter Otten __peter__ at web.de
Sat May 26 06:30:07 EDT 2018


junkaccount36 at outlook.com wrote:

> Hi,
> 
> Python newbie here. I need help with the following two tasks I need to
> accomplish using Python:
> 
> ------------------------
> 
> Creating a matrix of rolling variances
> 
> I have a pandas data frame of six columns, I would like to iteratively
> compute the variance along each column. Since I am a newbie, I don't
> really understand the niceties of the language and common usage patterns.
> What is the common Python idiom for achieving the following?

Knowledge of the language doesn't help much here; pandas and numpy are a 
world of its own. One rule I apply as an amateur: when you have to resort 
Python loops it gets slow ;)

> vars = []
> for i in range(1, 100000):
>     v = (data.iloc[range(0, i+1)].var()).values
>     if len(vars) == 0:
>         vars = v
>     else:
>         vars = np.vstack((vars, v))
> 
> Also, when I run this code, it takes a long time to execute. Can anyone
> suggest how to improve the running time?

I think I would forego pandas and use numpy

a = np.random.random((N, M))
vars = np.empty((N-1, M))
for i in range(1, N):
    vars[i-1] = a[:i+1].var(axis=0, ddof=1)

While it doesn't avoid the loop it may not need to copy as much data as your 
version.

> Pandas dataframe: sum of exponentially weighted correlation matrices per
> row
> 
> Consider the following dataframe:
> 
> df = pd.DataFrame(np.random.random((200,3)))
> df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
> df = df.set_index(['date'])
> 
> 
> date 0 1 2 3 4 5
> 2000-01-01 0.101782 0.111237 0.177719 0.229994 0.298786 0.747169
> 2000-01-02 0.348568 0.916997 0.527036 0.998144 0.544261 0.824907
> 2000-01-03 0.095015 0.480519 0.493345 0.632072 0.965326 0.244732
> 2000-01-04 0.502706 0.014287 0.045354 0.461621 0.359125 0.489150
> 2000-01-05 0.559364 0.337121 0.763715 0.460163 0.515309 0.732979
> 2000-01-06 0.488153 0.149655 0.015616 0.658693 0.864032 0.425497
> 2000-01-07 0.266161 0.392923 0.606358 0.286874 0.160191 0.573436
> 2000-01-08 0.786785 0.770826 0.202838 0.259263 0.732071 0.546918
> 2000-01-09 0.739847 0.886894 0.094900 0.257210 0.264688 0.005631
> 2000-01-10 0.615846 0.347249 0.516575 0.886096 0.347741 0.259998
> 
> Now, I want to treat each row as a vector and perform a multiplication
> like this:
> 
> [[0.101782]] [[0.101782 0.111237 0.177719 0.229994 0.298786 0.747169]]
> [[0.111237]]
> [[0.177719]]
> [[0.229994]]
> [[0.298786]]
> [[0.747169]]
> 
> For the i-th row, let's call this X_i. Now I have a parameter alpha and I
> want to multiply X_i with alpha^i and sum across all the i's. In the real
> world, I can have thousands of rows so I need to do this with reasonably
> good performance.

Again I'd use numpy directly. If I understand your second problem correctly

a = np.np.random.random((200, 3))
alpha = .5
b = alpha ** np.arange(200)
c = b * a ** 2
print(c.sum(axis=0))

If I got it wrong -- could you provide a complete (small!) example with the 
intermediate results?




More information about the Python-list mailing list