speed up pandas calculation

Wed Jul 30 21:11:55 EDT 2014

On Wed, Jul 30, 2014 at 5:57 PM, Vincent Davis <vincent at vincentdavis.net>
wrote:
>
> On Wed, Jul 30, 2014 at 6:28 PM, Vincent Davis <vincent at vincentdavis.net>
> wrote:
>
>> The real slow part seems to be
>> for n in drugs:
>>     df[n] =
>> df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1)
>>
>
> I was wrong, this is fast, it was selecting the columns that was slow.
> using
>

And that shows why profiling is important - before attempting to optimize
:).

>  keep_col = ['PATCODE', 'PATWT', 'VDAYR', 'VMONTH', 'MED1', 'MED2',
> 'MED3', 'MED4', 'MED5']
> df = df[keep_col]
>
> took the time down from 19sec to 2 sec.
>

 On Wed, Jul 30, 2014 at 5:57 PM, Steven D'Aprano <
steve+comp.lang.python at pearwood.info> wrote:

> ['a', 'b', 'c', 'd', 'e', ..., 'zzz']
>
> that is, a total of 26 + 26**2 + 26**3 = 18278 items. Now suppose you
> delete item 0, 'a':
>
> => ['b', 'c', 'd', 'e', ..., 'zzz']
>
> Python has to move the remaining 18278 items across one space. Then you
> delete 'b':
>

Really minor issue: I believe this should read 18277 items :).

> => ['c', 'd', 'e', ..., 'zzz']
>
> I'm not familiar with pandas and am not sure about the exact syntax
> needed, but something like:
>
> new_df = []  # Assuming df is a list.
> for col in df:
>     if col.value in keep_col:
>         new_df.append(col)
>

Another way to write this, using a list expression (untested):
new_df = [col for col in df if col.value in keep_col]

Also note that, while the code shows keep_col is fairly short, you may also
see performance gains if keep_col is a set ( O(1) lookup performance)
rather than a list ( O(n) lookup performance ). You would do this by using:

keep_col = set(('PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2',
'MED3', 'MED4', 'MED5'))

rather than your existing:

keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2',
'MED3', 'MED4', 'MED5']

This can apply anywhere you use the "in" operator. Note, however, that
generating the set is a bit slower, so you'd want to make sure the set is
made outside of a large loop.

Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20140730/6b493849/attachment.html>