[Pandas-dev] Pandas Sprint Recap

Wes McKinney wesmckinn at gmail.com
Mon Jul 16 21:04:21 EDT 2018


hi Pietro,

On Mon, Jul 16, 2018 at 8:23 PM, Pietro Battiston <me at pietrobattiston.it> wrote:
> Hi Wes,
>
> thanks for the extensive reply. But sorry, it's probably that I missed
> the sprint, but I really can't follow you. Do you have any pointers to
> better understand the future pandas (alternative) you have in mind? I
> know about Arrow, but I see it as a future potentiality for pandas, not
> as an alternative, or even the germ of it (and clearly not in the sense
> of "it's not powerful enough", but of "it has different scope"). Even
> less do I understand why pandas (or a "pandas-like library") should
> change name, if we are mostly talking about internals/implementation
> issues (rather than about API/features). Compared to this, the decision
> to rewrite the codebase or not is admittedly minor...
>
> I see a vision in your email, and certainly many political/community
> aspects I must be missing... but I still mostly miss the technical
> details supporting this vision, and apparently https://pandas-dev.githu
> b.io/pandas2 won't help me. Again, talking all together about what
> makes you think that the current codebase needs a complete rewrite
> would be great. Hope we can do this in one of the next devs calls.
>

Well, here is a document from more than 2 years ago now:

https://pandas-dev.github.io/pandas2/goals.html

The way I would summarize the big picture goals are:

* Simpler, more predictable and precise memory management
* Ability to work with memory-mapped, on-disk data (this part is essential)
* Substantially less memory use for non-numeric data
* More civilized copy-on-write semantics
* Improved interoperability with the rest of the world (being able to
reuse libraries, algorithms for analytics more gracefully)

I have been working very hard to present a sound, working,
non-hand-wavy solution to these low-level problems. I am a
mathematician by training, and so I am allergic to hand-wavy solutions
or "designs" lacking in rigor in the fine details.

I wrote this blog post addressing some of these topics and more:
http://wesmckinney.com/blog/apache-arrow-pandas-internals/

I have spent a great deal of energy in blog posts, slide decks, etc.
laying out the technical details about how can work and why it is a
sound approach. I am not sure what more I can do other than to hope
that those of like-mind and inclination to work on systems engineering
follow along.

Given the above requirements, I don't see a way forward that does not
involve at minimum scrapping pandas.core.internals.

> In any case,
>
> Il giorno lun, 16/07/2018 alle 19.17 -0400, Wes McKinney ha scritto:
>> [...]
>> For the record, I am never going to argue that pandas should not be
>> maintained or that the user base should be abandoned. However, I
>> question whether the current core maintainers have a duty to be
>> tethered to the issue backlog for the rest of their lives. Perhaps
>> maintenance could be taken up by a for-profit company at some point?
>
> The idea that I should sooner or later pay to use (a working version
> of) the code I'm helping to write is even more depressing, to me, than
> the idea that such effort will go partly wasted in a rewrite.
>
> I'm personally "tethered" to a software which changed the way I work
> every day, and to which I occasionally try to contribute back. The
> "backlog" is not just a pile of dirt: it signals that (net of some
> possible better triaging) there are things to fix in the software.
> I see any change, and even a rewrite, as good basically if and only if
> it allows us to reduce this "backlog".
>
> Your answer to the question "are we wasting time on pandas?" is
> basically "I'm not, you are". I wonder whether it was discussed in this
> terms at the sprint!

Whoa, I never said this and I do not believe anyone is wasting their time.

Maintaining/supporting pandas in its current state is a valid way to
spend your time, but my concern is that the feelings of obligation
toward keeping the status quo afloat may stop the community from
making progress on fundamental issues in performance and scalability.
Realistically we need to find a way to do both sustainably (though
it's arguable whether development now is sustainable).

-W

>
> Cheers,
>
> Pietro


More information about the Pandas-dev mailing list