From wesmckinn at gmail.com Wed May 3 19:09:43 2017 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 3 May 2017 19:09:43 -0400 Subject: [Pandas-dev] Developing libpandas as a separate codebase + Componentization + Deferred "pandas expressions" Message-ID: hi folks, Bit of a multi-tiered discussion, but it's all somewhat related so putting it all in one e-mail. *TOPIC ONE:* I have been thinking about how to proceed with pandas 2.0 development in a sane way with the following goals: * Delivering some incrementally valuable functionality to production pandas users (e.g. faster CSV reader, some faster algorithms). There might be faster multithreaded code we can make available via a memory layout conversion layer (NaNs to bitmaps, etc.) * Being able to install subcomponents of pandas 2.0 (like libpandas) alongside production pandas to get feedback from users, particularly around low-level data semantics (copy-on-write, etc.) * Migrating compiled code and utility algorithms out of current pandas codebase Changing the internals of Series and DataFrame is going to be a difficult process (and frankly, it would be easier to build a brand new project, but I am not going to advocate for that). But I think one way we can make things easier is by developing "libpandas" and its Python bindings as a separate codebase. What goes in libpandas? In my view: * The semantic contents of pandas._lib * New "guts" of Series and DataFrame, what I've been colloquially calling pandas.Array (Series with no Index) and pandas.Table (DataFrame with no index) * New implementations of Index, based on libpandas.Array * Computational libraries that presume a particular memory layout: pandas.core.algorithms, pandas.core.ops, pandas.core.nanops, etc. * Low-level IO code (moving data from other formats into new pandas data structures) The idea is that libpandas would (someday) be a hard dependency of pandas, and contain most or all of the compiled code in pandas. To simplify things for most contributors, we could publish nightly dev wheels or conda packages so that you can update libpandas in your dev environment and proceed with developing pure Python code. Let me know what you think. I've spent the majority of my net development time over the last year hardening Apache Arrow as a C++ library we can use in libpandas for physical columnar in-memory management, so I'm ready (now with the Arrow 0.3 release about ready to drop) to start making some more progress on this. *TOPIC TWO:* We discussed this on the last dev meeting call, but I wanted to see what others think and if there's some action items. To help with more frequent pandas releases, particularly of subcomponents which are pure Python, I wonder if we could move toward a release model of "pandas" as a metapackage for a series of subcomponents which are packaged independently. As an example pandas depends on pandas_display (Display for humans) pandas_io pandas_plotting pandas_core and so forth. I think it would be better to go with a single codebase for this; I don't have a strong opinion about having separate release cycles, it's more to help establish cleaner boundaries about use of private and public APIs. Effectively the codebase is already organized like this, so I'm not sure concretely what we would want to do around this. *TOPIC THREE:* I think we should start developing a "deferred pandas API" that is designed and directly developed by the pandas developer community. >From our respective experiences creating expression DSLs and other computation frameworks on top of pandas, I believe this is something where we can build something reasonable and useful. As one concrete problem this would help with: addressing some of the awkwardness around complex groupby-aggregate expressions (custom aggregations would simply be named expressions). The idea of the deferred expression API would be similar to dplyr in R: * "True" schemas (we'll have to work around pandas 0.x warts with implicit casts, etc.) * Immutable data structures / no mutation outside "amend" operations that change values by returning new objects * Less index-related stuff in this API (perhaps this is controversial, we shall see) We can create an in-memory backend for "pandas expressions" on pandas 0.x/1.0 and separately create an alternative backend using libpandas (once that is more fully baked / functional) -- this will also help provide a forcing function for implementing analytics that are required for implementing the backend. Distributed execution for us is almost certainly out of scope, and even if so we would probably want to offload onto prior art in Dask or elsewhere. So if the dask.dataframe API and the pandas expression API look different in ways that are unpleasant, we could either compile from pandas -> dask under the hood, or make API changes to make the semantics more conforming. When libpandas / pandas 2.0 is more mature we can consider building stronger out-of-core execution (plenty of prior art we can learn from here, e.g. SFrame). As far as tools to implement the deferred expression API -- I will leave this to discussion. I spent a considerable amount of time making a pandas-like expression API for SQL in Ibis (see https://github.com/cloudera/ibis/tree/master/ibis/expr) while I was at Cloudera, so there's some ideas there (like separating the "internal" AST from the "external" user expressions) that we can learn from, or fork or use some of that expression code in some way. I don't have a strong opinion as long as the expressions are as strongly-typed as possible (i.e. tables have schemas, operations have checked input and output types) and catch user errors as soon as feasible. This ended up being more text than I planned. If we want to discuss these things independently, feel free to send a reply with an altered subject line. Looking forward to see what everyone thinks. Thanks! Wes -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Wed May 3 19:42:34 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Wed, 3 May 2017 16:42:34 -0700 Subject: [Pandas-dev] Developing libpandas as a separate codebase + Componentization + Deferred "pandas expressions" In-Reply-To: References: Message-ID: On Wed, May 3, 2017 at 4:09 PM, Wes McKinney wrote: > *TOPIC ONE:* I have been thinking about how to proceed with pandas 2.0 > development in a sane way with the following goals: > > ... > > Changing the internals of Series and DataFrame is going to be a difficult > process (and frankly, it would be easier to build a brand new project, but > I am not going to advocate for that). But I think one way we can make > things easier is by developing "libpandas" and its Python bindings as a > separate codebase. > I'm strongly supportive of a separate "libpandas", but do consider going further and making "pandas2" a separate thing. If users have to switch from "import pandas" to "import pandas2", it would give us the freedom to do some important API clean-up/simplification (e.g., for indexing and other pandas methods that don't have well defined type signatures). Also, we will have the option to leave old stuff behind rather immediately porting everything to pandas2 with complete backport support, which is rather ambitious. > *TOPIC THREE:* I think we should start developing a "deferred pandas API" > that is designed and directly developed by the pandas developer community. > ... > > * "True" schemas (we'll have to work around pandas 0.x warts with implicit > casts, etc.) > > * Immutable data structures / no mutation outside "amend" operations that > change values by returning new objects > > * Less index-related stuff in this API (perhaps this is controversial, we > shall see) > This all sounds fantastic, but could you clarify a little bit what you mean by schemas? -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.w.augspurger at gmail.com Fri May 5 16:20:09 2017 From: tom.w.augspurger at gmail.com (Tom Augspurger) Date: Fri, 5 May 2017 15:20:09 -0500 Subject: [Pandas-dev] ANN: pandas v0.20.1 released Message-ID: Hi all, I'm happy to announce that pandas 0.20.0 and 0.20.1 have been released. Pandas 0.20.1 contains a single additional change from 0.20.0 for backwards compatibility with projects using pandas' utils methods. See https://github.com/pandas-dev/pandas/pull/16250. The full release notes for 0.20.0 are below. This is a major release from 0.19.2 and includes a number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. See the Whatsnew file for more information: http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html We recommend that all users upgrade to this version. This release includes 897 commits over 5 months of development by 204 contributors. A big thank you to all contributors! Tom --- *## What is it:* pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with ?relational? or ?labeled? data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. *## Highlights of the 0.20.0 release include:* - new .agg() API for Series/DataFrame similar to the groupby-rolling-resample API's, see here - Integration with the feather-format, including a new top-level pd.read_feather() and DataFrame.to_feather() method, see here - The .ix indexer has been deprecated, see here - Panel has been deprecated, see here - Addition of an IntervalIndex and Interval scalar type, see here - Improved user API when accessing levels in .groupby(), see here - Improved support for UInt64 dtypes, see here - A new orient for JSON serialization, orient='table', that uses the Table Schema spec, see here - Experimental support for exporting DataFrame.style formats to Excel, see here - Window Binary Corr/Cov operations now return a MultiIndexed DataFrame rather than a Panel, as Panel is now deprecated, see here - Support for S3 handling now uses s3fs, see here - Google BigQuery support now uses the pandas-gbq library, see here - Switched the test framework to use pytest *## How to get it:* Source tarballs and Windows / Mac / Linux wheels are available on PyPI (thanks to Christoph Gohlke for the windows wheels, and to Matthew Brett for setting up the Mac / Linux wheels) pip install --upgrade pip setuptools pip install --upgrade --upgrade-strategy=only-if-needed pandas Conda packages currently building, and will be available via the conda-forge channel (conda install pandas -c conda-forge). It will be available on the default channel soon. conda install -c conda-forge pandas *## Issues:* Please report any issues on our issue tracker: https://github.com/pydata/ pandas/issues/ *## Thanks to all the contributors:* - Adam J. Stewart - Adrian - Ajay Saxena - Akash Tandon - Albert Villanova del Moral - Aleksey Bilogur - Alexis Mignon - Amol Kahat - Andreas Winkler - Andrew Kittredge - Anthonios Partheniou - Arco Bast - Ashish Singal - Baurzhan Muftakhidinov - Ben Kandel - Ben Thayer - Ben Welsh - Bill Chambers - Brandon M. Burroughs - Brian - Brian McFee - Carlos Souza - Chris - Chris Ham - Chris Warth - Christoph Gohlke - Christoph Paulik - Christopher C. Aycock - Clemens Brunner - D.S. McNeil - DaanVanHauwermeiren - Daniel Himmelstein - Dave Willmer - David Cook - David Gwynne - David Hoffman - David Krych - Diego Fernandez - Dimitris Spathis - Dmitry L - Dody Suria Wijaya - Dominik Stanczak - Dr-Irv - Dr. Irv - Elliott Sales de Andrade - Ennemoser Christoph - Francesc Alted - Fumito Hamamura - Giacomo Ferroni - Graham R. Jeffries - Greg Williams - Guilherme Beltramini - Guilherme Samora - Hao Wu - Harshit Patni - Ilya V. Schurov - Iv?n Vall?s P?rez - Jackie Leng - Jaehoon Hwang - James Draper - James Goppert - James McBride - James Santucci - Jan Schulz - Jeff Carey - Jeff Reback - JennaVergeynst - Jim - Jim Crist - Joe Jevnik - Joel Nothman - John - John Tucker - John W. O'Brien - John Zwinck - Jon M. Mease - Jon Mease - Jonathan Whitmore - Jonathan de Bruin - Joost Kranendonk - Joris Van den Bossche - Joshua Bradt - Julian Santander - Julien Marrec - Jun Kim - Justin Solinsky - Kacawi - Kamal Kamalaldin - Kerby Shedden - Kernc - Keshav Ramaswamy - Kevin Sheppard - Kyle Kelley - Larry Ren - Leon Yin - Line Pedersen - Lorenzo Cestaro - Luca Scarabello - Lukasz - Mahmoud Lababidi - Mark Mandel - Matt Roeschke - Matthew Brett - Matthew Roeschke - Matti Picus - Maximilian Roos - Michael Charlton - Michael Felt - Michael Lamparski - Michiel Stock - Mikolaj Chwalisz - Min RK - Miroslav ?ediv? - Mykola Golubyev - Nate Yoder - Nathalie Rud - Nicholas Ver Halen - Nick Chmura - Nolan Nichols - Pankaj Pandey - Pawel Kordek - Pete Huang - Peter - Peter Csizsek - Petio Petrov - Phil Ruffwind - Pietro Battiston - Piotr Chromiec - Prasanjit Prakash - Rob Forgione - Robert Bradshaw - Robin - Rodolfo Fernandez - Roger Thomas - Rouz Azari - Sahil Dua - Sam Foo - Sami Salonen - Sarah Bird - Sarma Tangirala - Scott Sanderson - Sebastian Bank - Sebastian Gs?nger - Shawn Heide - Shyam Saladi - Sinhrks - Stephen Rauch - S?bastien de Menten - Tara Adiseshan - Thiago Serafim - Thoralf Gutierrez - Thrasibule - Tobias Gustafsson - Tom Augspurger - Tong SHEN - Tong Shen - TrigonaMinima - Uwe - Wes Turner - Wiktor Tomczak - WillAyd - Yaroslav Halchenko - Yimeng Zhang - abaldenko - adrian-stepien - alexandercbooth - atbd - bastewart - bmagnusson - carlosdanielcsantos - chaimdemulder - chris-b1 - dickreuter - discort - dr-leo - dubourg - dwkenefick - funnycrab - gfyoung - goldenbull - hesham.shabana - jojomdt - linebp - manu - manuels - mattip - maxalbert - mcocdawc - nuffe - paul-mannino - pbreach - sakkemo - scls19fr - sinhrks - stijnvanhoey - the-nose-knows - themrmax - tomrod - tzinckgraf - wandersoncferreira - watercrossing - wcwagner - xgdgsc - yui-knk -------------- next part -------------- An HTML attachment was scrubbed... URL: From robinfishbein at yahoo.com Thu May 11 03:07:59 2017 From: robinfishbein at yahoo.com (Robin Fishbein) Date: Thu, 11 May 2017 07:07:59 +0000 (UTC) Subject: [Pandas-dev] Performance drop with v0.20 - changes between two dataframes References: <88358759.8764046.1494486479452.ref@mail.yahoo.com> Message-ID: <88358759.8764046.1494486479452@mail.yahoo.com> I apologize in advance, I'm not sure where to ask about this, so I thought perhaps the dev list. The function below returns a tuple of dataframes to identify the rows added, changed, and deleted between two dataframes.* Adds and deletes should be an easy index comparison, so it should spend most of its time distinguishing changed rows from unchanged rows. With Python 3.6.0 and pandas 0.19.2 it runs on test files** in around 22 seconds, with get_loc and __getitem__ at the top of %prun. With pandas 0.20.1 it's unusably slow. I believe there's something about this expression? pd.Index(i for i in index_both?if not df1t[i].equals(df2t[i])) ?that is handled differently in 0.20. I'm not sure how, though, or what I could do to make this effective in v0.20. * I used?equals()?to address missing values and handle any potential dtype; if we guarantee no missing values, the solution is much easier. I transposed because I couldn't work out a better way to convert the rows to?Series, allowing the use of?equals().** About 54k rows, 4 index columns (all object), and 8 other columns (4 int, 3 obj, 1 datetime). Thanks!-Robin def delta(left, right, index_cols=None, suffixes=('_1', '_2'),? ? ? ? ? reset_indexes=True, value_counts=True):? ? df1 = left.copy()? ? df2 = right.copy()? ? if isinstance(index_cols, pd.Index):? ? ? ? index_cols = list(index_cols)? ? if index_cols:? ? ? ? df1 = df1.set_index(index_cols)? ? ? ? df2 = df2.set_index(index_cols)? ? full = (pd.merge(df1, df2, left_index=True, right_index=True,? ? ? ? ? ? ? ? ? ? ?how='outer', suffixes=suffixes, indicator=True)? ? ? ? ? ? .astype({'_merge': object}))? ? index_both = full[full._merge == 'both'].index? ? df1t = df1.reindex(index_both).T? ? df2t = df2.reindex(index_both).T? ? index_changes = pd.Index(i for i in index_both? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if not df1t[i].equals(df2t[i]))? ? if index_changes.size > 0:? ? ? ? full.loc[index_changes, '_merge'] = 'c'? ? mappings = {? ? ? ? 'both': 'm', ? ? ? ?# match? ? ? ? 'right_only': 'a', ?# add? ? ? ? 'c': 'c', ? ? ? ? ? # change? ? ? ? 'left_only': 'd', ? # delete? ? }? ? full._merge = full._merge.map(mappings)? ? add = df2.reindex(full.loc[full._merge == 'a'].index)? ? change = full.loc[full._merge == 'c'].drop('_merge', axis=1)? ? delete = df1.reindex(full.loc[full._merge == 'd'].index)? ? if reset_indexes:? ? ? ? full = full.reset_index()? ? ? ? add = add.reset_index()? ? ? ? change = change.reset_index()? ? ? ? delete = delete.reset_index()? ? if value_counts:? ? ? ? print(full._merge.value_counts())? ? return full, add, change, delete -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Thu May 11 08:35:39 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 11 May 2017 14:35:39 +0200 Subject: [Pandas-dev] Performance drop with v0.20 - changes between two dataframes In-Reply-To: <88358759.8764046.1494486479452@mail.yahoo.com> References: <88358759.8764046.1494486479452.ref@mail.yahoo.com> <88358759.8764046.1494486479452@mail.yahoo.com> Message-ID: Hi Robin, I didn't yet look into your code example (and it might be that it can be optimized in other ways to also prevent the slowdown), but as it seems you have multi-indexes and you mention getitem, this is probably the same issue as reported here: https://github.com/pandas-dev/pandas/issues/16319, and with already a PR open to try to fix it: https://github.com/pandas-dev/pandas/pull/16324 You are always welcome to try out the PR and give feedback on whether that fixed the performance regression for you. Regards, Joris 2017-05-11 9:07 GMT+02:00 Robin Fishbein via Pandas-dev < pandas-dev at python.org>: > I apologize in advance, I'm not sure where to ask about this, so I thought > perhaps the dev list. > > The function below returns a tuple of dataframes to identify the rows > added, changed, and deleted between two dataframes.* Adds and deletes > should be an easy index comparison, so it should spend most of its time > distinguishing changed rows from unchanged rows. With Python 3.6.0 and > pandas 0.19.2 it runs on test files** in around 22 seconds, with get_loc > and __getitem__ at the top of %prun. With pandas 0.20.1 it's unusably slow. > I believe there's something about this expression? > > pd.Index(i for i in index_both if not df1t[i].equals(df2t[i])) > > ?that is handled differently in 0.20. I'm not sure how, though, or what I > could do to make this effective in v0.20. > > * I used equals() to address missing values and handle any potential > dtype; if we guarantee no missing values, the solution is much easier. I > transposed because I couldn't work out a better way to convert the rows to > Series, allowing the use of equals(). > ** About 54k rows, 4 index columns (all object), and 8 other columns (4 > int, 3 obj, 1 datetime). > > Thanks! > -Robin > > def delta(left, right, index_cols=None, suffixes=('_1', '_2'), > reset_indexes=True, value_counts=True): > df1 = left.copy() > df2 = right.copy() > if isinstance(index_cols, pd.Index): > index_cols = list(index_cols) > if index_cols: > df1 = df1.set_index(index_cols) > df2 = df2.set_index(index_cols) > full = (pd.merge(df1, df2, left_index=True, right_index=True, > how='outer', suffixes=suffixes, indicator=True) > .astype({'_merge': object})) > index_both = full[full._merge == 'both'].index > df1t = df1.reindex(index_both).T > df2t = df2.reindex(index_both).T > * index_changes = pd.Index(i for i in index_both* > * if not df1t[i].equals(df2t[i]))* > if index_changes.size > 0: > full.loc[index_changes, '_merge'] = 'c' > mappings = { > 'both': 'm', # match > 'right_only': 'a', # add > 'c': 'c', # change > 'left_only': 'd', # delete > } > full._merge = full._merge.map(mappings) > add = df2.reindex(full.loc[full._merge == 'a'].index) > change = full.loc[full._merge == 'c'].drop('_merge', axis=1) > delete = df1.reindex(full.loc[full._merge == 'd'].index) > if reset_indexes: > full = full.reset_index() > add = add.reset_index() > change = change.reset_index() > delete = delete.reset_index() > if value_counts: > print(full._merge.value_counts()) > return full, add, change, delete > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robinfishbein at yahoo.com Thu May 11 10:41:08 2017 From: robinfishbein at yahoo.com (Robin Fishbein) Date: Thu, 11 May 2017 09:41:08 -0500 Subject: [Pandas-dev] Performance drop with v0.20 - changes between two dataframes In-Reply-To: References: <88358759.8764046.1494486479452.ref@mail.yahoo.com> <88358759.8764046.1494486479452@mail.yahoo.com> Message-ID: Thanks, Joris! I'll take a look at the PR. Turns out optimizing with something to the effect of ((pd.isnull(df1) & pd.isnull(df2)) | (df1 == df2)).all(axis=1) avoids this issue and runs the test files in 0.9 seconds with either v0.19 or v0.20. -Robin Sent from my iPhone > On May 11, 2017, at 7:35 AM, Joris Van den Bossche wrote: > > Hi Robin, > > I didn't yet look into your code example (and it might be that it can be optimized in other ways to also prevent the slowdown), but as it seems you have multi-indexes and you mention getitem, this is probably the same issue as reported here: https://github.com/pandas-dev/pandas/issues/16319, and with already a PR open to try to fix it: https://github.com/pandas-dev/pandas/pull/16324 > > You are always welcome to try out the PR and give feedback on whether that fixed the performance regression for you. > > Regards, > Joris > > 2017-05-11 9:07 GMT+02:00 Robin Fishbein via Pandas-dev : >> I apologize in advance, I'm not sure where to ask about this, so I thought perhaps the dev list. >> >> The function below returns a tuple of dataframes to identify the rows added, changed, and deleted between two dataframes.* Adds and deletes should be an easy index comparison, so it should spend most of its time distinguishing changed rows from unchanged rows. With Python 3.6.0 and pandas 0.19.2 it runs on test files** in around 22 seconds, with get_loc and __getitem__ at the top of %prun. With pandas 0.20.1 it's unusably slow. I believe there's something about this expression? >> >> pd.Index(i for i in index_both if not df1t[i].equals(df2t[i])) >> >> ?that is handled differently in 0.20. I'm not sure how, though, or what I could do to make this effective in v0.20. >> >> * I used equals() to address missing values and handle any potential dtype; if we guarantee no missing values, the solution is much easier. I transposed because I couldn't work out a better way to convert the rows to Series, allowing the use of equals(). >> ** About 54k rows, 4 index columns (all object), and 8 other columns (4 int, 3 obj, 1 datetime). >> >> Thanks! >> -Robin >> >> def delta(left, right, index_cols=None, suffixes=('_1', '_2'), >> reset_indexes=True, value_counts=True): >> df1 = left.copy() >> df2 = right.copy() >> if isinstance(index_cols, pd.Index): >> index_cols = list(index_cols) >> if index_cols: >> df1 = df1.set_index(index_cols) >> df2 = df2.set_index(index_cols) >> full = (pd.merge(df1, df2, left_index=True, right_index=True, >> how='outer', suffixes=suffixes, indicator=True) >> .astype({'_merge': object})) >> index_both = full[full._merge == 'both'].index >> df1t = df1.reindex(index_both).T >> df2t = df2.reindex(index_both).T >> index_changes = pd.Index(i for i in index_both >> if not df1t[i].equals(df2t[i])) >> if index_changes.size > 0: >> full.loc[index_changes, '_merge'] = 'c' >> mappings = { >> 'both': 'm', # match >> 'right_only': 'a', # add >> 'c': 'c', # change >> 'left_only': 'd', # delete >> } >> full._merge = full._merge.map(mappings) >> add = df2.reindex(full.loc[full._merge == 'a'].index) >> change = full.loc[full._merge == 'c'].drop('_merge', axis=1) >> delete = df1.reindex(full.loc[full._merge == 'd'].index) >> if reset_indexes: >> full = full.reset_index() >> add = add.reset_index() >> change = change.reset_index() >> delete = delete.reset_index() >> if value_counts: >> print(full._merge.value_counts()) >> return full, add, change, delete >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Thu May 11 10:50:32 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 11 May 2017 16:50:32 +0200 Subject: [Pandas-dev] Performance drop with v0.20 - changes between two dataframes In-Reply-To: References: <88358759.8764046.1494486479452.ref@mail.yahoo.com> <88358759.8764046.1494486479452@mail.yahoo.com> Message-ID: Good to hear. That is indeed often that case, that a bunch of indexing operations can be easily avoided with a smarter implementation. I think (but did not try out) that you can even further simplyfy to df1.equals(df2).all(axis=1) as equals should take care of those NaNs in the same places. Joris 2017-05-11 16:41 GMT+02:00 Robin Fishbein : > Thanks, Joris! I'll take a look at the PR. Turns out optimizing with > something to the effect of > ((pd.isnull(df1) & pd.isnull(df2)) | (df1 == df2)).all(axis=1) > avoids this issue and runs the test files in 0.9 seconds with either v0.19 > or v0.20. > > -Robin > > Sent from my iPhone > > On May 11, 2017, at 7:35 AM, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > > Hi Robin, > > I didn't yet look into your code example (and it might be that it can be > optimized in other ways to also prevent the slowdown), but as it seems you > have multi-indexes and you mention getitem, this is probably the same issue > as reported here: https://github.com/pandas-dev/pandas/issues/16319, and > with already a PR open to try to fix it: https://github.com/pandas-dev/ > pandas/pull/16324 > > You are always welcome to try out the PR and give feedback on whether that > fixed the performance regression for you. > > Regards, > Joris > > 2017-05-11 9:07 GMT+02:00 Robin Fishbein via Pandas-dev < > pandas-dev at python.org>: > >> I apologize in advance, I'm not sure where to ask about this, so I >> thought perhaps the dev list. >> >> The function below returns a tuple of dataframes to identify the rows >> added, changed, and deleted between two dataframes.* Adds and deletes >> should be an easy index comparison, so it should spend most of its time >> distinguishing changed rows from unchanged rows. With Python 3.6.0 and >> pandas 0.19.2 it runs on test files** in around 22 seconds, with get_loc >> and __getitem__ at the top of %prun. With pandas 0.20.1 it's unusably slow. >> I believe there's something about this expression? >> >> pd.Index(i for i in index_both if not df1t[i].equals(df2t[i])) >> >> ?that is handled differently in 0.20. I'm not sure how, though, or what I >> could do to make this effective in v0.20. >> >> * I used equals() to address missing values and handle any potential >> dtype; if we guarantee no missing values, the solution is much easier. I >> transposed because I couldn't work out a better way to convert the rows to >> Series, allowing the use of equals(). >> ** About 54k rows, 4 index columns (all object), and 8 other columns (4 >> int, 3 obj, 1 datetime). >> >> Thanks! >> -Robin >> >> def delta(left, right, index_cols=None, suffixes=('_1', '_2'), >> reset_indexes=True, value_counts=True): >> df1 = left.copy() >> df2 = right.copy() >> if isinstance(index_cols, pd.Index): >> index_cols = list(index_cols) >> if index_cols: >> df1 = df1.set_index(index_cols) >> df2 = df2.set_index(index_cols) >> full = (pd.merge(df1, df2, left_index=True, right_index=True, >> how='outer', suffixes=suffixes, indicator=True) >> .astype({'_merge': object})) >> index_both = full[full._merge == 'both'].index >> df1t = df1.reindex(index_both).T >> df2t = df2.reindex(index_both).T >> * index_changes = pd.Index(i for i in index_both* >> * if not df1t[i].equals(df2t[i]))* >> if index_changes.size > 0: >> full.loc[index_changes, '_merge'] = 'c' >> mappings = { >> 'both': 'm', # match >> 'right_only': 'a', # add >> 'c': 'c', # change >> 'left_only': 'd', # delete >> } >> full._merge = full._merge.map(mappings) >> add = df2.reindex(full.loc[full._merge == 'a'].index) >> change = full.loc[full._merge == 'c'].drop('_merge', axis=1) >> delete = df1.reindex(full.loc[full._merge == 'd'].index) >> if reset_indexes: >> full = full.reset_index() >> add = add.reset_index() >> change = change.reset_index() >> delete = delete.reset_index() >> if value_counts: >> print(full._merge.value_counts()) >> return full, add, change, delete >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Fri May 12 18:51:35 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Sat, 13 May 2017 00:51:35 +0200 Subject: [Pandas-dev] Componentization Message-ID: 2017-05-04 1:09 GMT+02:00 Wes McKinney : > > *TOPIC TWO:* We discussed this on the last dev meeting call, but I wanted > to see what others think and if there's some action items. To help with > more frequent pandas releases, particularly of subcomponents which are pure > Python, I wonder if we could move toward a release model of "pandas" as a > metapackage for a series of subcomponents which are packaged independently. > As an example > > pandas depends on > pandas_display (Display for humans) > pandas_io > pandas_plotting > pandas_core > > and so forth. I think it would be better to go with a single codebase for > this; I don't have a strong opinion about having separate release cycles, > it's more to help establish cleaner boundaries about use of private and > public APIs. Effectively the codebase is already organized like this, so > I'm not sure concretely what we would want to do around this. > > Would your idea be to have those as separate python packages, but in a single (github) repo? If we want to go in this direction, I would first like to see some more detailed practicalities, to be able to better evaluate if this would not needlessly make the contributing cycle more complex. Eg what would separate release cycles in practice mean? Would the subpackages then need to support multiple versions of the base package (at least master + latest stable, otherwise there is not really a benefit in a separate release cycle)? That would also mean more testing builds? How would it work with Travis with multiple packages in a single repo? (can it run separate tests depending on the PR?) Trying to establish cleaner boundaries between the private and public API's is good goal, but in principle we could also try to do this in the current directory layout (eg remove usage of private API's in pandas.plotting, pandas.io, ..). I am certainly not against a potential reorganisation on this front, but I would like to see some more advantages in doing so (as I am not familiar with such workflows from other project). Regards, Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Fri May 12 19:07:45 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Sat, 13 May 2017 01:07:45 +0200 Subject: [Pandas-dev] Dev meeting doodle Message-ID: The last dev meeting has been more than a month ago, so maybe time to at least try to find a date for the next one. Therefore, a doodle! https://doodle.com/poll/dp6s7r3q2u6wcyup Fixed the doodle at 2pm EST / 8pm GMT / 11am PST, like the last meeting (if we want to try another time, eg like two meetings ago, please let it know) Enough content to discuss from Wes' mail and the 0.21 / 1.0 topics listed in the notes from last meeting (we should probably pick some in advance to discuss). Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue May 30 09:33:31 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 30 May 2017 15:33:31 +0200 Subject: [Pandas-dev] Dev meeting doodle In-Reply-To: References: Message-ID: A bit late to notify this list, but the doodle decided on a meeting today. You are certainly still welcome to attend! Meeting at 2pm EST / 6pm GMT today Video Link: https://appear.in/pandas-dev Doc Link: https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dO kVJLY-licoBmBU/edit# 2017-05-13 1:07 GMT+02:00 Joris Van den Bossche < jorisvandenbossche at gmail.com>: > The last dev meeting has been more than a month ago, so maybe time to at > least try to find a date for the next one. Therefore, a doodle! > > https://doodle.com/poll/dp6s7r3q2u6wcyup > > Fixed the doodle at 2pm EST / 8pm GMT / 11am PST, like the last meeting > (if we want to try another time, eg like two meetings ago, please let it > know) > > Enough content to discuss from Wes' mail and the 0.21 / 1.0 topics listed > in the notes > > from last meeting (we should probably pick some in advance to discuss). > > Joris > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cpcloud at gmail.com Tue May 30 17:19:08 2017 From: cpcloud at gmail.com (Phillip Cloud) Date: Tue, 30 May 2017 21:19:08 +0000 Subject: [Pandas-dev] Pandas Deferred Expressions Message-ID: Hi all, I'd like to fork part of the thread from Wes's original email about the future of pandas and discuss all things deferred expressions. To start, here's Wes's original thoughts, and a response from Chris Bartak that was in a different thread. After I send this email I'm going to follow up with my own thoughts in a different email so I can address any specific concerns as well as offer up a list of advantages and disadvantages to this approach and lessons learned about building DSLs in Python. *Wes's post:* *TOPIC THREE:* I think we should start developing a "deferred pandas API" that is designed and directly developed by the pandas developer community. >From our respective experiences creating expression DSLs and other computation frameworks on top of pandas, I believe this is something where we can build something reasonable and useful. As one concrete problem this would help with: addressing some of the awkwardness around complex groupby-aggregate expressions (custom aggregations would simply be named expressions). The idea of the deferred expression API would be similar to dplyr in R: * "True" schemas (we'll have to work around pandas 0.x warts with implicit casts, etc.) * Immutable data structures / no mutation outside "amend" operations that change values by returning new objects * Less index-related stuff in this API (perhaps this is controversial, we shall see) We can create an in-memory backend for "pandas expressions" on pandas 0.x/1.0 and separately create an alternative backend using libpandas (once that is more fully baked / functional) -- this will also help provide a forcing function for implementing analytics that are required for implementing the backend. Distributed execution for us is almost certainly out of scope, and even if so we would probably want to offload onto prior art in Dask or elsewhere. So if the dask.dataframe API and the pandas expression API look different in ways that are unpleasant, we could either compile from pandas -> dask under the hood, or make API changes to make the semantics more conforming. When libpandas / pandas 2.0 is more mature we can consider building stronger out-of-core execution (plenty of prior art we can learn from here, e.g. SFrame). As far as tools to implement the deferred expression API -- I will leave this to discussion. I spent a considerable amount of time making a pandas-like expression API for SQL in Ibis (see https://github.com/cloudera/ibis/tree/master/ibis/expr) while I was at Cloudera, so there's some ideas there (like separating the "internal" AST from the "external" user expressions) that we can learn from, or fork or use some of that expression code in some way. I don't have a strong opinion as long as the expressions are as strongly-typed as possible (i.e. tables have schemas, operations have checked input and output types) and catch user errors as soon as feasible. *Chris B's response:* Deferred API Mixed thoughts about this. On the one hand, it's obviously a good thing, enables smarter execution, typing/schemas could result in much easier/safer to write code, etc. On the other hand, the pandas API is already massive and reasonably difficult to master, and it's a big ask to learn a new one. Dask is a good example of how NOT having a new API can be very valuable. All this to say I think adoption might be pretty low? Could be my own biases - coming from a "smallish data" user of pandas, I've never found the "write once, execute on different backends" argument especially compelling because I've never had the need. -------------- next part -------------- An HTML attachment was scrubbed... URL: From cpcloud at gmail.com Tue May 30 18:28:14 2017 From: cpcloud at gmail.com (Phillip Cloud) Date: Tue, 30 May 2017 22:28:14 +0000 Subject: [Pandas-dev] Pandas Deferred Expressions In-Reply-To: References: Message-ID: On Tue, May 30, 2017 at 5:19 PM Phillip Cloud wrote: Hi all, > > I'd like to fork part of the thread from Wes's original email about the > future of pandas and discuss all things deferred expressions. To start, > here's Wes's original thoughts, and a response from Chris Bartak that was > in a different thread. After I send this email I'm going to follow up with > my own thoughts in a different email so I can address any specific concerns > as well as offer up a list of advantages and disadvantages to this approach > and lessons learned about building DSLs in Python. > > *Wes's post:* > > *TOPIC THREE:* I think we should start developing a "deferred pandas API" > that is designed and directly developed by the pandas developer community. > From our respective experiences creating expression DSLs and other > computation frameworks on top of pandas, I believe this is something where > we can build something reasonable and useful. As one concrete problem this > would help with: addressing some of the awkwardness around complex > groupby-aggregate expressions (custom aggregations would simply be named > expressions). > > The idea of the deferred expression API would be similar to dplyr in R: > > * "True" schemas (we'll have to work around pandas 0.x warts with implicit > casts, etc.) > > * Immutable data structures / no mutation outside "amend" operations that > change values by returning new objects > > * Less index-related stuff in this API (perhaps this is controversial, we > shall see) > > We can create an in-memory backend for "pandas expressions" on pandas > 0.x/1.0 and separately create an alternative backend using libpandas (once > that is more fully baked / functional) -- this will also help provide a > forcing function for implementing analytics that are required for > implementing the backend. > > Distributed execution for us is almost certainly out of scope, and even if > so we would probably want to offload onto prior art in Dask or elsewhere. > So if the dask.dataframe API and the pandas expression API look different > in ways that are unpleasant, we could either compile from pandas -> dask > under the hood, or make API changes to make the semantics more conforming. > > When libpandas / pandas 2.0 is more mature we can consider building > stronger out-of-core execution (plenty of prior art we can learn from here, > e.g. SFrame). > > As far as tools to implement the deferred expression API -- I will leave > this to discussion. I spent a considerable amount of time making a > pandas-like expression API for SQL in Ibis (see > https://github.com/cloudera/ibis/tree/master/ibis/expr) while I was at > Cloudera, so there's some ideas there (like separating the "internal" AST > from the "external" user expressions) that we can learn from, or fork or > use some of that expression code in some way. I don't have a strong > opinion as long as the expressions are as strongly-typed as possible > (i.e. tables have schemas, operations have checked input and output types) > and catch user errors as soon as feasible. > > *Chris B's response:* > > Deferred API > > Mixed thoughts about this. On the one hand, it's obviously a good thing, > enables smarter execution, typing/schemas could result in much easier/safer > to write code, etc. > > On the other hand, the pandas API is already massive and reasonably > difficult to master, and it's a big ask to learn a new one. Dask is a good > example of how NOT having a new API can be very valuable. All this to say > I think adoption might be pretty low? Could be my own biases - coming from > a "smallish data" user of pandas, I've never found the "write once, execute > on different backends" argument especially compelling because I've never > had the need. > I agree with the underlying sentiment in Chris?s post. If we are going to build something new, there needs to be very compelling reasons to switch so that there?s some offset to the switching costs. Benefits I see from using expressions that individual users may find convincing: 1. Code correctness guarantees and API clarity using schemas and types. 1. Operations fail very early and tab completion shows you exactly what operations are valid on a particular object. 2. Optimizations through expression rewriting (column pruning, predicate pushdown). 1. We don?t need to read every column to select just one. Last time I checked nearly all of our IO APIs require reading in all columns to do an operation on just a few. 3. Somewhat ironically, a much smaller API to learn. 1. No indexes, extremely complex slicing or functions that have many different ways to do the same thing (like our old friend replace). Reasons that I think individual users will not find convincing: 1. The ability to run on multiple backends. Many people do not have this problem. I suspect the majority of pandas users do *not* have this problem. We shouldn?t try to convince our users that this is why they should switch, nor should we prioritize this aspect of pandas2. Potential pitfalls to adoption with using expressions to build pandas2: 1. Too dissimilar from current pandas. 2. Development getting bogged down in lowest common denominator problems (i.e., requiring that every backend implement every operation) resulting in an extremely limited API. 3. More abstract execution model, and therefore more difficult to understand and debug errors. I personally think we should do the following: 1. Draft a list of ?must-have? operations on DataFrames 2. Use ibis as a base for building experimental pandas deferred expressions. 3. Forget about supporting ?all the backends? and focus on SQL and pandas. Make sure that most of our users don?t have to care about this aspect of pandas. The fact that operations are delayed should be almost invisible unless desired. For example, even though we are delaying operations internally, the result should appear to be eagerly evaluated. The model would be: ?write once, execute on pandas only by default, nearly invisible to the user? 4. Go deep on pandas expressions and add non SQL compatible ones if necessary to preserve as much of the spec?d-out API that we can. 5. Try not to break backwards compatibility with SQL backends, but don?t require it if it?s needed for pandas2. Alternatively, we build the pandas backend on top of ibis instead of inside so that we have even more freedom. I?ve got a patch up that implements some of the pandas API in ibis here , if anyone would like to follow along. -Phillip ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mrocklin at gmail.com Tue May 30 18:51:35 2017 From: mrocklin at gmail.com (Matthew Rocklin) Date: Tue, 30 May 2017 18:51:35 -0400 Subject: [Pandas-dev] Pandas Deferred Expressions In-Reply-To: References: Message-ID: *(My apologies for chiming in here without intending to do any of the actual work.)* I wonder if there is a half-solution where a small subset of operations are lazy much in the same way that the current groupby operations are lazy in Pandas 0.x. If this laziness were extended to a small set of mostly linear operations (element-wise, filters, aggregations, column projections, groupbys) then that might hit a few of the bigger optimizations that people care about without going down the full lazy-relational-algebra-in-python path. Once you do an operation that is not one of these, we collapse the lazy dataframe and replace it with a concrete one. Slowing extending a small set of operations may also be doable in an incremental fashion as needed, which might be an easier transition for a community of users. Of course, half-measures can also cause more maintenance costs long term and may lack optimizations that Pandas devs find valuable. I'm unqualified to judge the merits of any of these solutions, just thought I'd bring this up. Feel free to ignore. On Tue, May 30, 2017 at 6:28 PM, Phillip Cloud wrote: > On Tue, May 30, 2017 at 5:19 PM Phillip Cloud wrote: > > Hi all, >> >> I'd like to fork part of the thread from Wes's original email about the >> future of pandas and discuss all things deferred expressions. To start, >> here's Wes's original thoughts, and a response from Chris Bartak that was >> in a different thread. After I send this email I'm going to follow up with >> my own thoughts in a different email so I can address any specific concerns >> as well as offer up a list of advantages and disadvantages to this approach >> and lessons learned about building DSLs in Python. >> >> *Wes's post:* >> >> *TOPIC THREE:* I think we should start developing a "deferred pandas >> API" that is designed and directly developed by the pandas developer >> community. From our respective experiences creating expression DSLs and >> other computation frameworks on top of pandas, I believe this is something >> where we can build something reasonable and useful. As one concrete problem >> this would help with: addressing some of the awkwardness around complex >> groupby-aggregate expressions (custom aggregations would simply be named >> expressions). >> >> The idea of the deferred expression API would be similar to dplyr in R: >> > >> * "True" schemas (we'll have to work around pandas 0.x warts with >> implicit casts, etc.) >> >> * Immutable data structures / no mutation outside "amend" operations that >> change values by returning new objects >> >> * Less index-related stuff in this API (perhaps this is controversial, we >> shall see) >> >> We can create an in-memory backend for "pandas expressions" on pandas >> 0.x/1.0 and separately create an alternative backend using libpandas (once >> that is more fully baked / functional) -- this will also help provide a >> forcing function for implementing analytics that are required for >> implementing the backend. >> >> Distributed execution for us is almost certainly out of scope, and even >> if so we would probably want to offload onto prior art in Dask or >> elsewhere. So if the dask.dataframe API and the pandas expression API >> look different in ways that are unpleasant, we could either compile from >> pandas -> dask under the hood, or make API changes to make the semantics >> more conforming. >> >> When libpandas / pandas 2.0 is more mature we can consider building >> stronger out-of-core execution (plenty of prior art we can learn from here, >> e.g. SFrame). >> >> As far as tools to implement the deferred expression API -- I will leave >> this to discussion. I spent a considerable amount of time making a >> pandas-like expression API for SQL in Ibis (see https://github.com/ >> cloudera/ibis/tree/master/ibis/expr) while I was at Cloudera, so there's >> some ideas there (like separating the "internal" AST from the "external" >> user expressions) that we can learn from, or fork or use some of that >> expression code in some way. I don't have a strong opinion as long as the >> expressions are as strongly-typed as possible (i.e. tables have >> schemas, operations have checked input and output types) and catch user >> errors as soon as feasible. >> >> *Chris B's response:* >> >> Deferred API >> >> Mixed thoughts about this. On the one hand, it's obviously a good thing, >> enables smarter execution, typing/schemas could result in much easier/safer >> to write code, etc. >> > >> On the other hand, the pandas API is already massive and reasonably >> difficult to master, and it's a big ask to learn a new one. Dask is a good >> example of how NOT having a new API can be very valuable. All this to say >> I think adoption might be pretty low? Could be my own biases - coming from >> a "smallish data" user of pandas, I've never found the "write once, execute >> on different backends" argument especially compelling because I've never >> had the need. >> > I agree with the underlying sentiment in Chris?s post. If we are going to > build something new, there needs to be very compelling reasons to switch so > that there?s some offset to the switching costs. > Benefits I see from using expressions that individual users may find > convincing: > > 1. Code correctness guarantees and API clarity using schemas and types. > 1. Operations fail very early and tab completion shows you exactly > what operations are valid on a particular object. > 2. Optimizations through expression rewriting (column pruning, > predicate pushdown). > 1. We don?t need to read every column to select just one. Last time > I checked nearly all of our IO APIs require reading in all columns to do an > operation on just a few. > 3. Somewhat ironically, a much smaller API to learn. > 1. No indexes, extremely complex slicing or functions that have > many different ways to do the same thing (like our old friend > replace). > > Reasons that I think individual users will not find convincing: > > 1. The ability to run on multiple backends. Many people do not have > this problem. I suspect the majority of pandas users do *not* have > this problem. We shouldn?t try to convince our users that this is why they > should switch, nor should we prioritize this aspect of pandas2. > > Potential pitfalls to adoption with using expressions to build pandas2: > > 1. Too dissimilar from current pandas. > 2. Development getting bogged down in lowest common denominator > problems (i.e., requiring that every backend implement every operation) > resulting in an extremely limited API. > 3. More abstract execution model, and therefore more difficult to > understand and debug errors. > > I personally think we should do the following: > > 1. Draft a list of ?must-have? operations on DataFrames > 2. Use ibis as a base for building experimental pandas deferred > expressions. > 3. Forget about supporting ?all the backends? and focus on SQL and > pandas. Make sure that most of our users don?t have to care about this > aspect of pandas. The fact that operations are delayed should be almost > invisible unless desired. For example, even though we are delaying > operations internally, the result should appear to be eagerly evaluated. > The model would be: ?write once, execute on pandas only by default, nearly > invisible to the user? > 4. Go deep on pandas expressions and add non SQL compatible ones if > necessary to preserve as much of the spec?d-out API that we can. > 5. Try not to break backwards compatibility with SQL backends, but > don?t require it if it?s needed for pandas2. Alternatively, we build the > pandas backend on top of ibis instead of inside so that we have even more > freedom. > > I?ve got a patch up that implements some of the pandas API in ibis here > , if anyone would like to > follow along. > > -Phillip > ? > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From remi.denise at pasteur.fr Mon May 22 05:11:39 2017 From: remi.denise at pasteur.fr (=?utf-8?B?UsOpbWkgIERFTklTRQ==?=) Date: Mon, 22 May 2017 09:11:39 +0000 Subject: [Pandas-dev] Question Message-ID: To whom it may concern, I wanted to say that I really like pandas, it is the best for what I?m doing. For one of my script I had to use rpy2 and that create a matrix but I needed to have this matrix to pandas so I use "import pandas.rpy.common as com" and a message came : FutureWarning: The pandas.rpy module is deprecated and will be removed in a future version. We refer to external packages like rpy2. See here for a guide on how to port your code to rpy2: http://pandas.pydata.org/pandas-docs/stable/r_interface.html import pandas.rpy.common as com So I check how to port my code to rpy2 instead of use it. But when I compare the two methods, it?s not the same. In rpy2 the method return me a numpy array, but your method return me a pandas DataFrame which have the name of the rows and columns as I want. So my question is how can I have exactly the same result as "com.convert_robj" with rpy2 because it?s not the same now and I want that pandas keep this method that does easily what I want and rpy2 doesn?t. Thank you for your answer R?mi -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed May 31 18:05:54 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 1 Jun 2017 00:05:54 +0200 Subject: [Pandas-dev] Question In-Reply-To: References: Message-ID: Can you give an actual small reproducible code example that shows the problem? Otherwise it will be difficult to help. Joris 2017-05-22 11:11 GMT+02:00 R?mi DENISE : > To whom it may concern, > > I wanted to say that I really like pandas, it is the best for what I?m > doing. > > For one of my script I had to use rpy2 and that create a matrix but I > needed to have this matrix to pandas so I use "import pandas.rpy.common as > com" and a message came : > > FutureWarning: The pandas.rpy module is deprecated and will be removed in a future version. We refer to external packages like rpy2. > See here for a guide on how to port your code to rpy2: http://pandas.pydata.org/pandas-docs/stable/r_interface.html > import pandas.rpy.common as com > > So I check how to port my code to rpy2 instead of use it. > > But when I compare the two methods, it?s not the same. In rpy2 the method > return me a numpy array, but your method return me a pandas DataFrame which > have the name of the rows and columns as I want. > > So my question is how can I have exactly the same result as > "com.convert_robj" with rpy2 because it?s not the same now and I want that > pandas keep this method that does easily what I want and rpy2 doesn?t. > > > Thank you for your answer > > R?mi > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: