[Pandas-dev] Dataframe summit @ EuroSciPy 2019

Fri Sep 13 15:32:01 EDT 2019

hey Marc,

On Fri, Sep 13, 2019 at 10:57 AM Marc Garcia <garcia.marc at gmail.com> wrote:
>
> Hi Wes,
>
> Thanks for the feedback. I actually discussed with Sylvain regarding the blog post, since it didn't seem to be the right channel to communicate them to you and the Arrow team if you didn't already discuss them. But he mentioned you already discussed them in the past. Also, worth commenting that the last point was from Maarten Breddels (vaex author).
>

Thanks. So I won't presume to know Maarten's intent (he can speak here
for himself), but he has made a number of public comments that seem to
me to discourage people from getting involved in the Arrow community.
>From this post

https://github.com/pandas-dev/pandas/issues/8640#issuecomment-527416973

"In vaex-core, currently ... we are not depending on arrow. ... My
point is, I think if general algorithms (especially string algos) go
into arrow, it will be 'lost' for use outside of arrow, because it's
such a big dependency."

I'd like to focus on this idea of code contributed to Apache Arrow
being "lost" or the project being "such a big dependency". Besides
appearing "throw shade" at me and other people, it doesn't make a lot
of sense to me. In vaex, Maarten has developed string algorithms that
execute against the Arrow string memory layout (and rebranded such
data "Superstrings"). I don't support this approach for a couple of
reasons

* It serves to promote vaex at the expense of other projects that use
some part of Apache Arrow, whether only the columnar specification or
one or more libraries
* The code cannot be easily reused outside of vaex

If the objective is to write code that deals with Arrow data, why not
do this development inside... the Arrow community? You have a group of
hundreds of developers around the world working together toward common
goals with shared build, test, and packaging infrastructure.
Applications that need to process strings may also need to send Arrow
protocol messages over shared memory or RPC, we have code for that.
There's good reason for creating a composable application development
platform.

The use of the "big dependency" argument is a distraction from the
central idea that the community ideally should work together to build
reusable code artifacts that everyone can use. If the part of the
project that you need is small, we should work together to make _just_
that part available to you in a way that is not onerous. I am happy to
do my part to help with this.

In summary, my position is that promoting Arrow development outside of
the Arrow community isn't good for the open source world as a whole.
Partly why I've made personal and financial sacrifices to establish
Ursa Labs and work full time on the project is precisely to support
developers of projects like Vaex in their use of the project. I can't
help them, though, if they don't want to be helped. We want to become
a dependency (in large or small part) of downstream projects so we can
work together and help each other.

> I don't know enough about C++ or Arrow to have my own opinion on any of them. Just tried to share what was discussed during the meeting, so it was shared with everybody who couldn't attend but could be interested. Also, re-reading what I wrote that "People were in general happy with the idea [Arrow]" may not emphasize enough the satisfaction with the project. But I can say that Sylvain made great comments about Arrow and you personally before commenting on the couple of things he disagrees on Arrow implementation. Sorry if I wasn't able to phrase things in the best way. I'm happy to make amendments if needed.
>
> Do you think it makes sense to forward your email to Sylvain? I know you already discussed with him, but may be worth discussing again? Just let me know.
>

I think it's best if we stick to public mailing lists so all of our
comments are on the public record, either

dev at apache.arrow.org, for general Arrow development matters or
pandas-dev at python.org, for pandas specific matters

Thanks,
Wes

> On Fri, Sep 13, 2019 at 4:16 PM Wes McKinney <wesmckinn at gmail.com> wrote:
>>
>> hey Marc,
>>
>> I saw the write-up about the meeting on your blog
>>
>> https://datapythonista.github.io/blog/dataframe-summit-at-euroscipy.html
>>
>> Thanks for making this happen! Sorry that I wasn't able to attend.
>>
>> It seems that Sylvain Corlay raised some concerns about the Apache
>> Arrow project. The best place to discuss these is on the
>> dev at arrow.apache.org mailing list. I welcome a direct technical
>> discussion.
>>
>> Some specific responses to some of these
>>
>> 1. Apache arrow C++ API and implementation not following common C++ idioms
>>
>> Sylvain has said this a number of times over the last couple of years
>> in various contexts. The Arrow C++ codebase is a _big_ project, and
>> this criticism AFAICT is specifically about a few header files (in
>> particular arrow/array.h) that he doesn't like. I have said many
>> times, publicly and privately, that the solution to this is to develop
>> an STL-compliant interface layer to the Arrow columnar format that
>> suits the desires of groups like xtensor. We have invited the xtensor
>> developers to contribute more to Apache Arrow. There is nothing
>> structural about this project that's preventing this from happening.
>>
>> We also invite an independent, wholly header-only STL-compliant
>> implementation of the Arrow columnar data structures. PRs welcome.
>>
>> 2. Using a monorepo (including all bindings in the same repo as Arrow)
>>
>> It would be more helpful to have a discussion about this on the dev@
>> mailing list to understand why this is a concern. We have many
>> interdependent components, written in different programming languages,
>> and the monorepo structure enables us to have peace of mind that pull
>> requests to one component aren't breaking any other. For example, we
>> have binary protocol integration tests testing 4 different
>> implementations against each other on every commit: C++, Go, Java, and
>> JavaScript, with C# and Rust on their way eventually.
>>
>> Unfortunately, people like to criticize monorepos as a matter of
>> principle. But if you actually look at the testing requirements that a
>> project has, often a monorepo is the only really viable solution. I'm
>> open minded about concrete alternative proposals to the project's
>> current structure that enable us to verify whether PRs breaks any of
>> our integration tests (keep in mind the PRs routinely touch multiple
>> project components).
>>
>> 3. Not a clear distinction between the specification and
>> implementation (as in for instance project Jupyter)
>>
>> This is a red herring. It's about the *community*. In the ASF, we have
>> saying "Community over Code". One of the artifacts that the Arrow
>> community has produced is a specification for a columnar in-memory
>> data representation. At this point, the Arrow columnar specification
>> is a relatively small part of the work that's been produced by the
>> community, though it's obviously very important. I explained this in
>> my recent workshop on the project at VLDB
>>
>> * https://twitter.com/wesmckinn/status/1169277437856964614
>> * https://www.slideshare.net/wesm/apache-arrow-workshop-at-vldb-2019-boss-session-169065658
>>
>> More generally, I'm interested to understand at what point projects
>> would be able to take on Apache Arrow as dependency. The goal of the
>> project (and why I've invested ~4 years of my life and counting in it)
>> is to make everyone's lives _easier_, not harder. It seems to me to be
>> an inevitability of time, and so if there is work that we can be
>> prioritizing to speed along this outcome, please let me know.
>>
>> Thanks,
>> Wes
>>
>> On Tue, Jul 16, 2019 at 6:25 AM Marc Garcia <garcia.marc at gmail.com> wrote:
>> >
>> > For the people who has shown interest in joining remote, I added you to the repo of the summit [1], feel free to open issues there of the topics you're interested in discussing. I also created a Gitter channel that you can join.
>> >
>> > EuroSciPy doesn't currently have budget to life stream the session, but if we find a sponsor we'll do it, and also publish the recording in youtube. Based on the experience with the European pandas summit this seems unlikely.
>> >
>> > Cheers!
>> >
>> > 1. https://github.com/python-sprints/dataframe-summit
>> > 2. https://gitter.im/py-sprints/dataframe-summit
>> >
>> >
>> > On Wed, Jul 10, 2019 at 8:30 AM Pietro Battiston <me at pietrobattiston.it> wrote:
>> >>
>> >> Hi Marc,
>> >>
>> >> cool!
>> >>
>> >> I won't be able to attend Euroscipy, but if in the "Maintainers
>> >> session" you plan to have a way to participate remotely, I'll
>> >> definitely do.
>> >>
>> >> (I might be busy on the 6th instead... still don't know for sure)
>> >>
>> >> Pietro
>> >>
>> >> Il giorno gio, 04/07/2019 alle 15.45 +0100, Marc Garcia ha scritto:
>> >> > Hi there,
>> >> >
>> >> > Just to let you know that at EuroSciPy 2019 (in September in Spain)
>> >> > we will have a dataframe summit, to stay updated and coordinate among
>> >> > projects replicating the pandas API (other dataframe projects are
>> >> > more than welcome).
>> >> >
>> >> > Maintainers from all the main projects (pandas, dask, vaex, modin,
>> >> > cudf and koalas) will be attending. If you want to get involved
>> >> > (whether you can attend the conference or not), please DM me.
>> >> >
>> >> > More info: https://github.com/python-sprints/dataframe-summit
>> >> > Conference website: https://www.euroscipy.org/2019/
>> >> >
>> >> > Cheers!
>> >> > _______________________________________________
>> >> > Pandas-dev mailing list
>> >> > Pandas-dev at python.org
>> >> > https://mail.python.org/mailman/listinfo/pandas-dev
>> >>
>> > _______________________________________________
>> > Pandas-dev mailing list
>> > Pandas-dev at python.org
>> > https://mail.python.org/mailman/listinfo/pandas-dev