Dataframe iterating question : 3 ways of calling a row and column

Tue Aug 22 07:39:24 EDT 2017

On 8/21/17, zach.smith at orthofi.com <zach.smith at orthofi.com> wrote:
> I wouldn't say I'm a Python noob, but I wouldn't say I'm a Python expert
> either. I work in data science and use Pandas Dataframes a lot. My question
> is regarding the difference in calling out a specific row, column
> combination in a dataframe.
>
> I see 3 ways of doing this:
> (1) df.loc[row_ind, column_ind]
> (2) df.column_ind.loc[row_ind]
> (3) df[column_ind].loc[row_ind]
>
> where column_ind is the column name & row_ind is the named row index/row
> name in the dataframe.
>
> Can anyone enlighten me as to the differences between the above 3 methods of
> getting to the same cell in the dataframe?
> Are there speed differences?
> Is it simply a preference thing?
> Is there a PEP8 preferred way of doing this?
> Are there specific disadvantages to any of the methods?
>
> Thanks in advance.
> Zach

First of all I am not expert in pandas or python either. I write just
a few thoughts...

I don't think PEP-8 is about it.

df.column_id is calling __getattr__ where (after some checks)
df[column_id] is returned. So it is slower than df[column_id].

But if you are doing things in REPL you could probably be happy to use
tab completion to get column name.

Or if you are writing paper in (for example) jupyter notebook
readability of your formulas could count more than speed!

BTW. there are more ways to do it (and I could miss many others) ->

(4) df.column_id[row_ind]

(5) df.get_value(row_ind, column_ind)

(6) df.ix[row_ind, column_ind]

interestingly doc -
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ix.html
doesn't say it is deprecated (
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#deprecate-ix
)

Just quick stupid test (python 3.6.2, IPython 6.1.0, pandas 0.20.3)
gave me this results (sorted by speed):

df = pd.DataFrame(data=[[1, 2, 3], [4, 5, 6]], columns=['Aaaaa',
'Bbbbb', 'Ccccc'])

%timeit df.ix[1,0]  # this is deprecated!
188 µs ± 6.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.Aaaaa.loc[1]
46.2 µs ± 908 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit df['Aaaaa'].loc[1]
42.6 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

aa = df.Aaaaa
%timeit aa[1]
16.6 µs ± 519 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit df.iloc[1,0]
14.5 µs ± 92.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit df.loc[1,'Aaaaa']
13.8 µs ± 251 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit df.ix[1,'Aaaaa']  # this is deprecated!
8.68 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit df.get_value(1, 'Aaaaa')
3.51 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

But I want to add that thinking about speed of getting one value could
be thinking in wrong direction because using built in functions could
be better.

I just test my stupid benchmark (and gave quite opposite result :P) ->

%timeit sum(df.sum())
150 µs ± 2.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

def tst():
    summa = 0
    for i in range(len(df)):
        for j in df.columns:
            summa += df.get_value(i, j)
    return summa
%timeit tst()
37.6 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

But with just a little bigger data frame it is 10x faster using built
in function! ->

df = pd.DataFrame(data=[[1+3*i, 2+3*i, 3+3*i]  for i in range(100)],
columns=['Aaaaa', 'Bbbbb', 'Ccccc'])

def tst():
    summa = 0
    for i in range(len(df)):
        for j in df.columns:
            summa += df.get_value(i, j)
    return summa
%timeit tst()
1.67 ms ± 68.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit sum(df.sum())
151 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)