[Pandas-dev] Pandas Sprint Recap

Tue Jul 17 05:00:58 EDT 2018

Hi Stephan,

I appreciate that your email focuses on specific API changes (again,
the only reason, in my view, to justify a pandas 2, or even a change of
name)

Il giorno lun, 16/07/2018 alle 18.14 -0700, Stephan Hoyer ha scritto:
> On Mon, Jul 16, 2018 at 7:23 PM Pietro Battiston <me at pietrobattiston.
> it> wrote:
> [...]

> 2. The indexed pandas.Series and pandas.DataFrame isn't the right
> abstraction for many tasks. A simpler, index free DataFrame would be
> a better data model for many tasks. For tasks that really need axis
> labels, a tool like xarray might be more appropriate.

... but this makes me smile.

First, because labels/indexes are in my experience the main reason why
people come to pandas (another important reason is having multiple
dtypes in a single data structure, but numpy structured arrays also do
this).

Second because supporting a DataFrame with no index would be pretty
easy in the current codebase/API (e.g. "index=False").
I know it would break some code, but it would be wrong code anyway
(that is, code that doesn't decouple indexes from data storage).

Third, because now that the default index is RangeIndex(n) (which a
user is free not to rely on in any way), and as long as broken code is
fixed (see above), a DataFrame with no index wouldn't really be
"simpler". It would mostly amount to deciding whether to show the index
or not when printing to screen/doing IO.

Fourth, because you cite xarray as an alternative... but unless I'm
wrong, labels are now optional in xarray (precisely the path I suggest
we could take).

More in general, in my view, asking users to choose between multiple
dtypes and indexes would bring the state of data manipulation in Python
backwards by several years (and probably behind the state of data
manipulation in R).

> 3. Despite its flaws, pandas is extremely useful, so it has grown a
> large number of features/contributions. It would be difficult to
> reimplement all of these features immediately on top of a new
> implementation.
> 
> Given infinite manpower, all these things could be changed
> incrementally and in a backwards compatible manner on top of current
> pandas. But the result would look very different from the pandas we
> know today. Of course, we are vastly under-resourced, so it will take
> quite a long time to get to a better place. I don't think it would
> serve either users or developers well to make such major changes in
> an incremental way over the course of multiple years.
>  
> For these reasons, I agree with Wes that it wouldn't make sense to
> call the hypothetical Python library he is working towards pandas, at
> least in the sense that you use it by writing "import pandas". At
> best, we should write "import pandas2". Or perhaps, as Wes suggests,
> it would more appropriately be given a new name to indicate its new
> design/scope.

I understand the need to change the import name not to break people's
code.

But again, I fail to see a new "scope". I don't see an analysis of
which (share) of the current pandas problems (=issues) would be solved.
Add to this the urge of rewriting the code base (while at the same time
I'm having troubles getting any comments on _how_ to restructure - or
structure, in a potential rewrite - an important part of it¹), and I
can't help but fear that we are trying very hard to reinvent the wheel.

Pietro

¹ https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restr
ucturing-indexing-code