[Pandas-dev] Pandas Sprint Recap

Wes McKinney wesmckinn at gmail.com
Mon Jul 16 13:50:29 EDT 2018


> One thing I want to reiterate: it's not going to take another 11 years to
> reach pandas 2.0 :) Just because we don't
> solve indexing for 1.0 doesn't mean we won't ever be able to fix it.

One point on this that we discussed some in the sprint and during
SciPy: to undertake a major overhaul of pandas, at some point it may
require a shift to a "new codebase". This could cohabit the same
pandas-dev/pandas git repository which can serve as a monorepo for
several Python package artifacts. This would make refactoring to
separate out reusable components and code reuse much easier. The test
suite could also be refactored to be able to run against
"future-pandas" and "pandas" (or whatever we want to call them).

I'm skeptical whether the kinds of significant / breaking changes
we've discussed the last 3 years can happen in an iterative / organic
fashion within the current pandas codebase. I'd like to avoid getting
stuck in place for a decade; if we haven't made much progress toward
some of these major changes by say beginning of 2020 or 2021 we might
want to take a step back and evaluate our situation.

I spent some time at the sprint looking through pandas.core.internals,
pandas.core.generic, and some of the other low level pieces, and my
feeling is that it would be easier to start over.

All of this is made much more difficult by pandas's spartan funding
situation (Joris and Tom supported at ~50% time, rest of maintainers
are volunteers AFAIK).

In the meantime, personally my efforts will continue to be focused on
building portable, front end agnostic, reusable computational
libraries for in-memory computing on large tabular datasets (ie
https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-platform-for-inmemory-data-105427919).
I believe that bootstrapping a much larger community to work on these
problems will reduce our collective maintenance burden (though it is
likely to take a number of years for this to pay off).

- Wes

On Fri, Jul 13, 2018 at 1:45 PM, Tom Augspurger
<tom.augspurger88 at gmail.com> wrote:
> Thanks Pietro,
>
> We didn't discuss indexing much, beyond agreeing that there's work to be
> done, and that fixing it was too large
> a task for 1.0.
>
> As for whether an individual issue is a bug or feature, we'll have to
> continue using our judgement. I think we'll
> inevitably break users' code in a 1.x release as we fix bugs.
>
> We'll need to discuss workflows for these large changes (e.g. ripping out
> the block manager) that will be API
> breaking, but may take some time to land. Keeping a separate branch in sync
> is a pain, but may be the least
> painful alternative.
>
> One thing I want to reiterate: it's not going to take another 11 years to
> reach pandas 2.0 :) Just because we don't
> solve indexing for 1.0 doesn't mean we won't ever be able to fix it.
>
> Tom
>
> On Fri, Jul 13, 2018 at 12:12 PM, Pietro Battiston <me at pietrobattiston.it>
> wrote:
>>
>> Hi Tom,
>>
>> first, thanks to all those who participated in the sprint, and for the
>> recap.
>>
>> Il giorno dom, 08/07/2018 alle 16.26 -0500, Tom Augspurger ha scritto:
>> > [...]
>> > I've posted a document on our wiki with a summary of the topics
>> > discussed. https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(J
>> > uly,-2018)
>> >
>> > If people have questions or comments, feel free to post here and
>> > we'll clarify that document.
>>
>> Something that scares me - but maybe because I'm missing something
>> obvious - is what exactly qualifies as "deprecation". Is it something
>> which was once presented as a distinct feature and is then disabled, or
>> any general change to what any API call performs (that is, anything
>> requiring a deprecation cycle - that is)?
>>
>> There are many bugs - in particular, in indexing code - which might
>> potentially break existing code when fixed. Some of them will have non-
>> trivial deprecation paths/detection strategies. The first ones that
>> come to my mind are #18631, #12827, #9519. The last one, in particular,
>> implies changing the result of potentially tons of calls to .loc on a
>> non-unique index.
>>
>> My view is that those (and many more, including several that will be
>> found) will be best fixed through a total rewrite of indexing code
>> (i.e., all code in indexing.py, and some code in internals.py), which I
>> assumed would happen before 1.0, and which I certainly won't be able to
>> do before 0.24.0 (September 2018).
>> I'm clearly not claiming that nobody else can do it (nor that the bugs
>> can necessarily only be fixed through a complete rewrite)... but since
>> I did not get any feedback on
>> https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restruc
>> turing-indexing-code
>> ... I assume that nobody is focusing/planning to focus on this in the
>> near future (or was it somehow discussed in the sprint?).
>>
>> I perfectly understand the desire to stop postponing 1.0 to a vague
>> future, if it's just a matter of recognizing that pandas is worth
>> using.
>> But if it's a statement/commitment about code robustness/quality, and
>> relatedly API stability... then I think we it is risky to leave the
>> indexing API, and more in general the core codebase (as opposed to
>> important but more lateral features such as new dtypes) out of the
>> picture (e.g. out of #21894).
>>
>> Cheers,
>>
>> Pietro
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>


More information about the Pandas-dev mailing list