[Pandas-dev] Performance drop with v0.20 - changes between two dataframes

Robin Fishbein robinfishbein at yahoo.com
Thu May 11 03:07:59 EDT 2017


I apologize in advance, I'm not sure where to ask about this, so I thought perhaps the dev list.
The function below returns a tuple of dataframes to identify the rows added, changed, and deleted between two dataframes.* Adds and deletes should be an easy index comparison, so it should spend most of its time distinguishing changed rows from unchanged rows. With Python 3.6.0 and pandas 0.19.2 it runs on test files** in around 22 seconds, with get_loc and __getitem__ at the top of %prun. With pandas 0.20.1 it's unusably slow. I believe there's something about this expression—
pd.Index(i for i in index_both if not df1t[i].equals(df2t[i]))
—that is handled differently in 0.20. I'm not sure how, though, or what I could do to make this effective in v0.20.
* I used equals() to address missing values and handle any potential dtype; if we guarantee no missing values, the solution is much easier. I transposed because I couldn't work out a better way to convert the rows to Series, allowing the use of equals().** About 54k rows, 4 index columns (all object), and 8 other columns (4 int, 3 obj, 1 datetime).
Thanks!-Robin
def delta(left, right, index_cols=None, suffixes=('_1', '_2'),          reset_indexes=True, value_counts=True):    df1 = left.copy()    df2 = right.copy()    if isinstance(index_cols, pd.Index):        index_cols = list(index_cols)    if index_cols:        df1 = df1.set_index(index_cols)        df2 = df2.set_index(index_cols)    full = (pd.merge(df1, df2, left_index=True, right_index=True,                     how='outer', suffixes=suffixes, indicator=True)            .astype({'_merge': object}))    index_both = full[full._merge == 'both'].index    df1t = df1.reindex(index_both).T    df2t = df2.reindex(index_both).T    index_changes = pd.Index(i for i in index_both                             if not df1t[i].equals(df2t[i]))    if index_changes.size > 0:        full.loc[index_changes, '_merge'] = 'c'    mappings = {        'both': 'm',        # match        'right_only': 'a',  # add        'c': 'c',           # change        'left_only': 'd',   # delete    }    full._merge = full._merge.map(mappings)    add = df2.reindex(full.loc[full._merge == 'a'].index)    change = full.loc[full._merge == 'c'].drop('_merge', axis=1)    delete = df1.reindex(full.loc[full._merge == 'd'].index)    if reset_indexes:        full = full.reset_index()        add = add.reset_index()        change = change.reset_index()        delete = delete.reset_index()    if value_counts:        print(full._merge.value_counts())    return full, add, change, delete
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170511/bea7d157/attachment.html>


More information about the Pandas-dev mailing list