[Pandas-dev] Pandas Sprint Recap

Mon Jul 16 19:17:05 EDT 2018

hi Pietro,

On Mon, Jul 16, 2018 at 6:35 PM, Pietro Battiston <me at pietrobattiston.it> wrote:
> Il giorno lun, 16/07/2018 alle 13.50 -0400, Wes McKinney ha scritto:
>> > One thing I want to reiterate: it's not going to take another 11
>> > years to
>> > reach pandas 2.0 :) Just because we don't
>> > solve indexing for 1.0 doesn't mean we won't ever be able to fix
>> > it.
>>
>> One point on this that we discussed some in the sprint and during
>> SciPy: to undertake a major overhaul of pandas, at some point it may
>> require a shift to a "new codebase". This could cohabit the same
>> pandas-dev/pandas git repository which can serve as a monorepo for
>> several Python package artifacts. This would make refactoring to
>> separate out reusable components and code reuse much easier. The test
>> suite could also be refactored to be able to run against
>> "future-pandas" and "pandas" (or whatever we want to call them).
>>
>> I'm skeptical whether the kinds of significant / breaking changes
>> we've discussed the last 3 years can happen in an iterative / organic
>> fashion within the current pandas codebase. I'd like to avoid getting
>> stuck in place for a decade; if we haven't made much progress toward
>> some of these major changes by say beginning of 2020 or 2021 we might
>> want to take a step back and evaluate our situation.
>
> Let me reverse the question: how much progress has/will have pandas 2.0
> codebase made in the meanwhile? :-)
>
> Joking apart, it's not that if current pandas progresses slowly, then
> pandas 2.0 has any guarantee to progress more quickly.
>
> There is a thing I never understood of pandas 2.0, and I undertand even
> less now that 1.0 gets closer: if you/we have clear plans to rewrite
> the codebase, then why aren't we doing it now? Why are we wasting time
> on the current one?! Why are we releasing a pandas 1.0 with its
> "illusion of maturity"?!
> Rewriting a large project takes effort but can be worth it; _planning_
> a (not so close) future rewrite seems to me just a sort of perversion.

So when you say "if you/we have clear plans to rewrite the codebase,
then why aren't we doing it now", why do you presume that I am not
doing that right now? When I initiated conversations around improved
internals for pandas and projects like pandas in late 2015, I had just
signed on a large group of people to start the Apache Arrow project
and that's been about 90% of where I've invested my time since then.

My idea with Arrow always has been to build stronger memory management
and computational underpinning for a next-generation pandas-type
library. One of the "original sins" of pandas is that we own the full
stack: data structures, IO, deserialization and serialization,
computation/algorithms, visualization, and front end UI.

We are (more or less) completely on our own.

What I am proposing is to share the burden of developing the low-level
stuff with a vastly larger group of developers, at least 10x as large
as we currently have in pandas. The wheels are already well in motion
for this to happen.

I don't see any way without basing the work on top of open standards
developed with a community that extends beyond the walls of Python.
Maybe I'm going about it wrong, but I've invested 3 years of my life
in this at this point, and it's looking like a 7-10 year effort.

This is all to say, if the pandas community doesn't agree with my
approach to this problem, I'm not going to twist anyone's arm. Either
we agree or we go our separate ways; we are all volunteers after all.
Unfortunately there are still factions within the Python data world
that do not collaborate with each other very actively; I'm not sure
what to do about that.

We're all getting a lot older; if it turns out that the Python /
pandas community doesn't want to leap forward in the ways that we've
discussed (where this "leap forward" requires a certain amount of
funding and activation energy) after say 20 years since the inception
of the project then, as they say, that will be the story of us for the
history books. At some point the world could in all likelihood move on
from us to something else.

>
> I do respect the desire to improve the API under several aspects - and
> I see this as the main reason for having something called pandas 2.0.

Frankly, I would rather have a new project name, but retain
affiliation with the pandas community in spirit and governance.

>
> But I think this discussion of the API could and should be decoupled
> from the idea to rewrite/reorganize internals.
>
> Indexing code and many internals badly need a rewrite, regardless of
> whether we change the API, regardles of whether we call it "2.0", and
> regardless of whether we change the entire codebase all at once, or
> refactor bit by bit. They firstly need it because there are 313 open
> bugs labeled "Indexing", and some of them are very difficult to solve
> because the code is unnecessarily complicated.
> But I think this rewrite is basically what is happening daily.
>
> More generally, if our plan is to close, sooner or later, 2400+ bugs by
> basically saying "pandas 1.0 is obsolete, long live pandas 2.0"... then
> we are not doing a great service to our users in releasing pandas 1.0
> as such.

I don't think pandas as it exists now will ever be obsolete, at least
not on a 10 year horizon. At some point I think we should close off
the core to anything new, and restrict changes to either bug fixes or
deprecations / removals. New functionality should come in the form of
add-on libraries that build off the core API.

In a way, what I would like to see is something more like what the R
community has -- data frames are "built into the language" and so many
libraries can work on the same data and be assured of interop. We are
already sort of doing this with pandas + statsmodels, sklearn, etc.
but I would argue it needs to be taken even further.

For the record, I am never going to argue that pandas should not be
maintained or that the user base should be abandoned. However, I
question whether the current core maintainers have a duty to be
tethered to the issue backlog for the rest of their lives. Perhaps
maintenance could be taken up by a for-profit company at some point?

I do know that it is difficult to impossible to innovate and build new
software while simultaneously keeping up with a bugfix/maintenance
grind.

>
>
>> I spent some time at the sprint looking through
>> pandas.core.internals,
>> pandas.core.generic, and some of the other low level pieces, and my
>> feeling is that it would be easier to start over.
>
> I have a different feeling on what is easier, but I might very well be
> wrong, or it might be a matter of personal taste (e.g. it is true that
> when I reformat some code in current pandas I more often than not end
> up finding bugs in some other place, but at least I can immediately
> test the code I am writing because that "other place" exists).
>
> Something we all agree on is that, be in a new or in the same codebase,
> rewriting internals takes time and effort. It requires that some of the
> already few devs divert effort from improving the current codebase to
> focusing on the new one. If we plan to do it, shouldn't we be doing it
> now?
>
> (And if we don't do it now, isn't it because we don't really feel the
> urge to do it?)

See above...

- Wes

>
> In any case, what your mail suggests me is that we definitely need to
> spend one (more more) dev talk looking through pandas.core.internals
> and pandas.core.generic all together!
>
> Cheers,
>
> Pietro