[Pandas-dev] Pandas Sprint Recap

Wed Jul 18 03:23:35 EDT 2018

Il giorno mar, 17/07/2018 alle 15.28 -0700, Stephan Hoyer ha scritto:
> On Tue, Jul 17, 2018 at 2:01 AM Pietro Battiston <me at pietrobattiston.
> it> wrote:
> > First, because labels/indexes are in my experience the main reason
> > why
> > people come to pandas (another important reason is having multiple
> > dtypes in a single data structure, but numpy structured arrays also
> > do
> > this).
> 
> Certainly the functionality of indexes is valuable (especially from
> some use-cases), but I don't think the particular way we expose them
> is optimal. In my experience, the need to call reset_index() or
> assign directly to .index or .columns is a frequent source of
> annoyance.

I agree if you're refering to the impossibility to do this in-line...
but that's not extremely difficult to solve.

> > Second because supporting a DataFrame with no index would be pretty
> > easy in the current codebase/API (e.g. "index=False").
> > I know it would break some code, but it would be wrong code anyway
> > (that is, code that doesn't decouple indexes from data storage).
> > 
> > Third, because now that the default index is RangeIndex(n) (which a
> > user is free not to rely on in any way), and as long as broken code
> > is
> > fixed (see above), a DataFrame with no index wouldn't really be
> > "simpler". It would mostly amount to deciding whether to show the
> > index
> > or not when printing to screen/doing IO.
> 
> Sure, you *could* fix all this on top of the current pandas data
> model. But it would be quite a challenging effort, and the full
> pandas data model would remain quite complex.
> 
> The current pandas data model looks something like this:
> 
> DataFrame:
> - values: BlockManager wrapping 1d and/or 2d NumPy arrays
> - index: Index
> - columns: Index
>  
> The data model I'd like to work with in the future for most use-cases 
> involving tabular data is something closer to:
> 
> DataFrame:
> - data: OrderedDict[str, Array]
> - indexes: OrderedDict[str, Index]
> 
> Conveniently, this looks very similar to the data model of Arrow or
> R. Optional indexes would provide fast reverse lookup for some subset
> of dataframe columns.

How is this different (API-wise) from a list of (same lenght, I assume)
Series?

> This pretty obviously could not support everything pandas can do
> today. For example, you couldn't have a hierarchical index for column
> names. But in my experience, you're better off working with "tidy
> data" anyways, as popularized in R's tidyverse.

We have deprecated Panel (and I think it was the right choice) because
you can always tell (and I've often told) users "work with MultiIndex
and stack/unstack, that's efficient and much better and easier to
understand".

... do you instead see all the stack/unstack machinery as just
useless?!

Do you have a feeling of what the user base (present and potential)
thinks about this?
(Do you think it matters in some way?)

> > But again, I fail to see a new "scope". I don't see an analysis of
> > which (share) of the current pandas problems (=issues) would be
> > solved.
> 
> One way in which we have reduced pandas' scope recently is the
> proposed deprecation of Panel.
> 
> This is an example of focusing pandas on tabular data rather than N-
> dimensional arrays.

See above, we have lost N-dimensional arrays (with N=3 or 4) but
luckily not the concept of N-features data.
I can't even think of retrieving data with pandaSDMX without MultiIndex
columns, let alone manipulate it in any way.

We are constantly comparing pandas to R, but while so far I have always
implicitly thought we were proud of the agile manipulation abilities
that pandas has and R frames don't, only now I seem to understand we
are envious, for some weird reason, of their lack of features.

I understand we could be envious that they have a cleaner codebase...
but since it's GPL covered, we actually have no idea :-D
And more seriously, we have talked very little (also in the
sprint, from what I understand) of the possibility to improve/clean the
internals code.

I think we agree on the fact that having data stored in a BlockManager
based on as many arrays as there are dtypes does not per se really
qualify as rocket science.
But then, even if we decided that we are not good programmers enough to
implement this cleanly, even the solution of storing stuff as single
arrays but leaving the API (e.g. .columns as Index) as it is is
superior to the DataFrame being a mere collection of (same length)
Series.

Pietro