[Pandas-dev] Pandas Sprint Recap

Wed Jul 18 13:30:01 EDT 2018

> These are certainly not mutually exclusive options -- there is room for packages that provide all of these data models (simple dataframes like R, indexed dataframes like pandas, N-D labeled arrays like xarray, and even N-D arrays without labels like NumPy). I do hope that one day all of them can share the same foundation -- that would have major benefits for the ecosystem.

On this subject -- one of the objectives of the next years is to
enable dplyr / tidyverse expressions to run atop Arrow-based data
frames (in addition to R's native data frames). While this will
require some type coercion in some cases (for strings, non-numeric
data) the net benefits in terms of SIMD / parallelization /
out-of-core computing should be well worth it. The tidyverse
developers have created much cleaner boundaries between the expression
/ API semantics and the implementation details than we have, and this
has all happened in the last 5 years.

On Wed, Jul 18, 2018 at 1:16 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
> On Wed, Jul 18, 2018 at 12:23 AM Pietro Battiston <me at pietrobattiston.it>
> wrote:
>>
>> > The data model I'd like to work with in the future for most use-cases
>> > involving tabular data is something closer to:
>> >
>> > DataFrame:
>> > - data: OrderedDict[str, Array]
>> > - indexes: OrderedDict[str, Index]
>> >
>> > Conveniently, this looks very similar to the data model of Arrow or
>> > R. Optional indexes would provide fast reverse lookup for some subset
>> > of dataframe columns.
>>
>> How is this different (API-wise) from a list of (same lenght, I assume)
>> Series?
>
>
> It does sound very similar to me -- the DataFrame just provides a nice way
> to do collective operations.
>
>> See above, we have lost N-dimensional arrays (with N=3 or 4) but
>> luckily not the concept of N-features data.
>> I can't even think of retrieving data with pandaSDMX without MultiIndex
>> columns, let alone manipulate it in any way.
>> ...
>>
>> But then, even if we decided that we are not good programmers enough to
>> implement this cleanly, even the solution of storing stuff as single
>> arrays but leaving the API (e.g. .columns as Index) as it is is
>> superior to the DataFrame being a mere collection of (same length)
>> Series.
>
>
> To be entirely clear, I'm only speaking for myself -- not Wes or the entire
> pandas development team. I wasn't even at the sprint!
>
> I certainly find stacking/unstacking useful, but it is isn't the only way to
> manipulate multi-dimensional tabular data. I do think R's tidyverse shows an
> alternative viable path. Without having used it extensively, it appears to
> be more consistent and easier to use than pandas.
>
> For multi-dimensional data analysis, these days I generally prefer to use
> xarray (disclaimer: my project) instead of a pandas.MultiIndex. I find it
> more satisfying to have indexed N-D arrays (in an xarray.Dataset) rather
> than indexed 2D dataframes. The way that pandas.DataFrame uses an Index for
> both row and column labels makes it in some ways similar to the fixed 2D
> numpy.matrix, which personally I find less useful. It also makes all pandas
> operations more complex to implement than those on index-free "simple"
> dataframes.
>
> These are certainly not mutually exclusive options -- there is room for
> packages that provide all of these data models (simple dataframes like R,
> indexed dataframes like pandas, N-D labeled arrays like xarray, and even N-D
> arrays without labels like NumPy). I do hope that one day all of them can
> share the same foundation -- that would have major benefits for the
> ecosystem.