[Pandas-dev] Performance drop with v0.20 - changes between two dataframes

Robin Fishbein robinfishbein at yahoo.com
Thu May 11 10:41:08 EDT 2017


Thanks, Joris! I'll take a look at the PR. Turns out optimizing with something to the effect of
((pd.isnull(df1) & pd.isnull(df2)) |  (df1 == df2)).all(axis=1)
avoids this issue and runs the test files in 0.9 seconds with either v0.19 or v0.20.

-Robin

Sent from my iPhone

> On May 11, 2017, at 7:35 AM, Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
> 
> Hi Robin,
> 
> I didn't yet look into your code example (and it might be that it can be optimized in other ways to also prevent the slowdown), but as it seems you have multi-indexes and you mention getitem, this is probably the same issue as reported here: https://github.com/pandas-dev/pandas/issues/16319, and with already a PR open to try to fix it: https://github.com/pandas-dev/pandas/pull/16324
> 
> You are always welcome to try out the PR and give feedback on whether that fixed the performance regression for you.
> 
> Regards,
> Joris
> 
> 2017-05-11 9:07 GMT+02:00 Robin Fishbein via Pandas-dev <pandas-dev at python.org>:
>> I apologize in advance, I'm not sure where to ask about this, so I thought perhaps the dev list.
>> 
>> The function below returns a tuple of dataframes to identify the rows added, changed, and deleted between two dataframes.* Adds and deletes should be an easy index comparison, so it should spend most of its time distinguishing changed rows from unchanged rows. With Python 3.6.0 and pandas 0.19.2 it runs on test files** in around 22 seconds, with get_loc and __getitem__ at the top of %prun. With pandas 0.20.1 it's unusably slow. I believe there's something about this expression—
>> 
>> pd.Index(i for i in index_both if not df1t[i].equals(df2t[i]))
>> 
>> —that is handled differently in 0.20. I'm not sure how, though, or what I could do to make this effective in v0.20.
>> 
>> * I used equals() to address missing values and handle any potential dtype; if we guarantee no missing values, the solution is much easier. I transposed because I couldn't work out a better way to convert the rows to Series, allowing the use of equals().
>> ** About 54k rows, 4 index columns (all object), and 8 other columns (4 int, 3 obj, 1 datetime).
>> 
>> Thanks!
>> -Robin
>> 
>> def delta(left, right, index_cols=None, suffixes=('_1', '_2'),
>>           reset_indexes=True, value_counts=True):
>>     df1 = left.copy()
>>     df2 = right.copy()
>>     if isinstance(index_cols, pd.Index):
>>         index_cols = list(index_cols)
>>     if index_cols:
>>         df1 = df1.set_index(index_cols)
>>         df2 = df2.set_index(index_cols)
>>     full = (pd.merge(df1, df2, left_index=True, right_index=True,
>>                      how='outer', suffixes=suffixes, indicator=True)
>>             .astype({'_merge': object}))
>>     index_both = full[full._merge == 'both'].index
>>     df1t = df1.reindex(index_both).T
>>     df2t = df2.reindex(index_both).T
>>     index_changes = pd.Index(i for i in index_both
>>                              if not df1t[i].equals(df2t[i]))
>>     if index_changes.size > 0:
>>         full.loc[index_changes, '_merge'] = 'c'
>>     mappings = {
>>         'both': 'm',        # match
>>         'right_only': 'a',  # add
>>         'c': 'c',           # change
>>         'left_only': 'd',   # delete
>>     }
>>     full._merge = full._merge.map(mappings)
>>     add = df2.reindex(full.loc[full._merge == 'a'].index)
>>     change = full.loc[full._merge == 'c'].drop('_merge', axis=1)
>>     delete = df1.reindex(full.loc[full._merge == 'd'].index)
>>     if reset_indexes:
>>         full = full.reset_index()
>>         add = add.reset_index()
>>         change = change.reset_index()
>>         delete = delete.reset_index()
>>     if value_counts:
>>         print(full._merge.value_counts())
>>     return full, add, change, delete
>> 
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170511/6c39f3f8/attachment.html>


More information about the Pandas-dev mailing list