[Pandas-dev] Dataframe summit @ EuroSciPy 2019

Sun Sep 15 17:07:09 EDT 2019

hi Maarten,

This discussion definitely brings up "The Lisp Curse" for me

http://www.winestockwebdesign.com/Essays/Lisp_Curse.html

"Lisp is so powerful that problems which are technical issues in other
programming languages are social issues in Lisp."

I understand that a lot of miscommunication in open source comes down
to differing perspectives on things, so I will try to explain some of
my concerns in response to your comments.

On Sun, Sep 15, 2019 at 2:40 PM Maarten Breddels
<maartenbreddels at gmail.com> wrote:
>
> Dear Wes,
>
> Let me start by saying that I really appreciate your work on Apache Arrow. I expressed that quite clearly at the summit, during my talk and I talk about arrow in the “superstring” article. In general I try to be as supportive of Arrow as I can. From this reply I read that you think differently about my actions. Let us see if we can convince you I have good intentions.
>
>
> On 13 Sep 2019, at 21:32, Wes McKinney <wesmckinn at gmail.com> wrote:
>
> hey Marc,
>
> On Fri, Sep 13, 2019 at 10:57 AM Marc Garcia <garcia.marc at gmail.com> wrote:
>
>
> Hi Wes,
>
> Thanks for the feedback. I actually discussed with Sylvain regarding the blog post, since it didn't seem to be the right channel to communicate them to you and the Arrow team if you didn't already discuss them. But he mentioned you already discussed them in the past. Also, worth commenting that the last point was from Maarten Breddels (vaex author).
>
>
> Thanks. So I won't presume to know Maarten's intent (he can speak here
> for himself), but he has made a number of public comments that seem to
> me to discourage people from getting involved in the Arrow community.
> From this post
>
> https://github.com/pandas-dev/pandas/issues/8640#issuecomment-527416973
>
>
> Quite the opposite, I would like people to get involved, myself included. But I think the approach for strings can be done a bit differently to also make the C++ community benefit from the work (I’ll elaborate a bit more on that thread not to go offtopic).
>
>
> "In vaex-core, currently ... we are not depending on arrow. ... My
> point is, I think if general algorithms (especially string algos) go
> into arrow, it will be 'lost' for use outside of arrow, because it's
> such a big dependency."
>
> I'd like to focus on this idea of code contributed to Apache Arrow
> being "lost" or the project being "such a big dependency". Besides
> appearing "throw shade" at me and other people, it doesn't make a lot
> of sense to me. In vaex, Maarten has developed string algorithms that
> execute against the Arrow string memory layout (and rebranded such
> data "Superstrings"). I don't support this approach for a couple of
> reasons
>
>
> I don’t wanna throw shade on you, I’ll elaborate on the “lost” idea on GitHub.
>
> The reasons that the string processing in vaex is not in Arrow has multiple reasons:
>  * What I wanted was not possible in arrow 0.12 (32bit vs 64 bit offsets) https://github.com/apache/arrow/issues/3845 now with arrow 0.14 this is possible
>  * I actually saw Arrow as a memory layout spec + implementations. The realisation that that algorithms were also a part of Arrow came during development and discussing with Uwe, and seeing the 0.13 changelist (I think that included value_counts?). I see now that this is clearly mentioned on the website. This is probably my misunderstanding.
>  * Funding/time: This work was unfunded, but we had to get this in in short amount of time. We took on some technical dept instead. The code would never be acceptable for a PR to Arrow.
>
> Although I wish I could contribute (more) to Arrow, my time is limited, I do my best to open issues, but you simply cannot expect people to contribute to other open source projects no matter what.
>

I don't expect that, but I don't think it's good to suggest that other
people should _also_ not contribute (without disclosing the potential
problems -- see below). Whether or not that was your intent, that's
how the comments came across to me and others.

>
>
> * It serves to promote vaex at the expense of other projects that use
> some part of Apache Arrow, whether only the columnar specification or
> one or more libraries
>
>
> Not my indent, since I mention Arrow in the superstring article, I did not meant to rebrand them at all. I explicitly mentioned Arrow as a gesture of goodwill, and because I think the spec is good.
>
> * The code cannot be easily reused outside of vaex
>
>
> Totally agree with you on that, and that’s why I replied to the Github issue, since I care. I spend quite some effort on that, and I think nobody should have to do that again (but again, lets continue that on GitHub).
>
>
> If the objective is to write code that deals with Arrow data, why not
> do this development inside... the Arrow community? You have a group of
> hundreds of developers around the world working together toward common
> goals with shared build, test, and packaging infrastructure.
> Applications that need to process strings may also need to send Arrow
> protocol messages over shared memory or RPC, we have code for that.
> There's good reason for creating a composable application development
> platform.
>
> The use of the "big dependency" argument is a distraction from the
> central idea that the community ideally should work together to build
> reusable code artifacts that everyone can use. If the part of the
> project that you need is small, we should work together to make _just_
> that part available to you in a way that is not onerous. I am happy to
> do my part to help with this.
>
>
>
> In summary, my position is that promoting Arrow development outside of
> the Arrow community isn't good for the open source world as a whole.
>
>
> I think this is a big part of the misunderstanding and/or disagreement. Don’t take it as bad intent of people don’t agree with you on this, or if they assume otherwise. My idea was that Apache Arrow defined the memory spec, and we can all happily build on top of that, with Apache Arrow as dependency or not, as long as you follow the spec, we can all “speak the same data”. I am still not fully convinced that everything should to in the Arrow project, since there will always be some obscure algorithms that needs to work on Arrow data that you don’t want in your repo. So I think we might disagree on where to draw the line. You want to have a lot of algorithms directly into Arrow. That’s you decision, and I’m ok with that, but don’t attack me because I did not know your intentions/plans.
>

I think the "follow the spec" idea here is where we are diverging.

The trouble is: creating a 100% complete and provably correct
implementation of the columnar specification is difficult. There are
hundreds of details that must be exactly right lest users cause
applications to compute incorrect results or even segfault.

So, my argument (which we can agree to disagree about) is that having
a proliferation of independent and incomplete Arrow implementations is
almost certainly hurtful to the open source community.

If you implement 10% or 20% of the columnar specification and tell
people that you are "Arrow-based" or "following the Arrow spec",
what's wrong with that? Well, in the short term it may be okay. In the
long term, let's say that we end up with a situation like this:

* A reference C++ implementation of 100% of the columnar spec, and a
library ecosystem that relies on a common core library, call this
ecosystem REFERENCE
* Some number of independent "Arrow core" implementations in C or C++
that are less than 100% complete. Let's call these libraries
THIRDPARTY_A, THIRDPARTY_B, etc.

If there is a fragmentation of functionality between these projects,
developers may need things from different libraries. For example,
suppose I need an algorithm from THIRDPARTY_A. But THIRDPARTY_A may
not have a complete, battle-tested implementation of Arrow's binary
protocol. So to use THIRDPARTY_A's code, someone will have to write
and maintain a "serializer" to zero-copy cast data structures from one
library's data structures to the other. If THIRDPARTY_A has
implemented something incorrectly, data handed off to REFERENCE may be
difficult to fully validate as being compliant and so segfaults or
worse could ensue.

This will create an inherent tension with such developers that will
discourage their involvement or use of REFERENCE or THIRDPARTY_A, or
both.

It may be the case that the developers of THIRDPARTY_A don't care
about some of the stuff in REFERENCE, and they don't care about the
parts of the specification that they haven't implemented or haven't
tested thoroughly. That's fine, but because many developers don't
understand how extensive the specification is, if you say that your
third party project is "Arrow-compatible" or "following the Arrow
spec" they may be fooled into believing that they can use code from
multiple "Arrow-compatible" projects without pain or risk to their
applications.

So when I read your comments, they said to me "I am not depending on
or contributing to REFERENCE because of $REASONS, but I would be
interested in creating THIRDPARTY_A that also does not depend on
REFERENCE". I think this is a much riskier path that it seems at face
value. You said in your Superstring article "it means that all
projects supporting the Apache Arrow format will be able to use the
same data structure without any memory copying". The handoff (whether
in-memory or via the IPC protocol) from project A to project B becomes
a source of risk if project A and project B have been insufficiently
integration tested.

I would like to see everyone depending on a common core library with a
100% complete and battle-tested implementation of the specification
and binary protocol to eliminate these risks and fragmentation of
labor around compatibility testing. If there are practical barriers to
this, I'd like to understand what they are so we can work together to
eliminate them.

Another area where I'm quite interested is to create a "nanoarrow"
ANSI C or C99 library that provides an ultraminimalist set of C data
structures to use as the basis for third party applications to develop
their custom algorithms against if they want the smallest possible
dependency to vendor into their project. Then we can build and
maintain C bindings to the C++ library to leverage more advanced
features (like memory mapping) if needed. This is inspired by
https://github.com/nanopb/nanopb

>
> Partly why I've made personal and financial sacrifices to establish
> Ursa Labs and work full time on the project is precisely to support
> developers of projects like Vaex in their use of the project. I can't
> help them, though, if they don't want to be helped. We want to become
> a dependency (in large or small part) of downstream projects so we can
> work together and help each other.
>
>
> I *do* want to be helped :) I see so much more possibilities when vaex-core depends on Arrow, but this was not possible before 0.14. You mentioned on twitter some issues with wheels on Windows right? If that’s fixed I see no reason not to adopt Arrow and to depend on it.
>

As far as I know 0.15.0 will have Windows wheels. In the absence of
more maintainers for wheels the most likely scenario is that the wheel
packages will be more minimalist (with many components disabled -- due
to the compatibility issues around having statically-linked LLVM
symbols and other things in wheels) while the conda packages will be
much more comprehensive.

Thanks,
Wes

>
>
> I don't know enough about C++ or Arrow to have my own opinion on any of them. Just tried to share what was discussed during the meeting, so it was shared with everybody who couldn't attend but could be interested. Also, re-reading what I wrote that "People were in general happy with the idea [Arrow]" may not emphasize enough the satisfaction with the project. But I can say that Sylvain made great comments about Arrow and you personally before commenting on the couple of things he disagrees on Arrow implementation. Sorry if I wasn't able to phrase things in the best way. I'm happy to make amendments if needed.
>
> Do you think it makes sense to forward your email to Sylvain? I know you already discussed with him, but may be worth discussing again? Just let me know.
>
>
> I think it's best if we stick to public mailing lists so all of our
> comments are on the public record, either
>
> dev at apache.arrow.org, for general Arrow development matters or
> pandas-dev at python.org, for pandas specific matters
>
> Thanks,
> Wes
>
> On Fri, Sep 13, 2019 at 4:16 PM Wes McKinney <wesmckinn at gmail.com> wrote:
>
>
> hey Marc,
>
> I saw the write-up about the meeting on your blog
>
> https://datapythonista.github.io/blog/dataframe-summit-at-euroscipy.html
>
> Thanks for making this happen! Sorry that I wasn't able to attend.
>
> It seems that Sylvain Corlay raised some concerns about the Apache
> Arrow project. The best place to discuss these is on the
> dev at arrow.apache.org mailing list. I welcome a direct technical
> discussion.
>
> Some specific responses to some of these
>
> 1. Apache arrow C++ API and implementation not following common C++ idioms
>
> Sylvain has said this a number of times over the last couple of years
> in various contexts. The Arrow C++ codebase is a _big_ project, and
> this criticism AFAICT is specifically about a few header files (in
> particular arrow/array.h) that he doesn't like. I have said many
> times, publicly and privately, that the solution to this is to develop
> an STL-compliant interface layer to the Arrow columnar format that
> suits the desires of groups like xtensor. We have invited the xtensor
> developers to contribute more to Apache Arrow. There is nothing
> structural about this project that's preventing this from happening.
>
> We also invite an independent, wholly header-only STL-compliant
> implementation of the Arrow columnar data structures. PRs welcome.
>
> 2. Using a monorepo (including all bindings in the same repo as Arrow)
>
> It would be more helpful to have a discussion about this on the dev@
> mailing list to understand why this is a concern. We have many
> interdependent components, written in different programming languages,
> and the monorepo structure enables us to have peace of mind that pull
> requests to one component aren't breaking any other. For example, we
> have binary protocol integration tests testing 4 different
> implementations against each other on every commit: C++, Go, Java, and
> JavaScript, with C# and Rust on their way eventually.
>
> Unfortunately, people like to criticize monorepos as a matter of
> principle. But if you actually look at the testing requirements that a
> project has, often a monorepo is the only really viable solution. I'm
> open minded about concrete alternative proposals to the project's
> current structure that enable us to verify whether PRs breaks any of
> our integration tests (keep in mind the PRs routinely touch multiple
> project components).
>
>
> Totally with you on monorepos, vaex is using the same, and I believe it saves me a lot of time.
>
>
> 3. Not a clear distinction between the specification and
> implementation (as in for instance project Jupyter)
>
> This is a red herring. It's about the *community*. In the ASF, we have
> saying "Community over Code". One of the artifacts that the Arrow
> community has produced is a specification for a columnar in-memory
> data representation. At this point, the Arrow columnar specification
> is a relatively small part of the work that's been produced by the
> community, though it's obviously very important. I explained this in
> my recent workshop on the project at VLDB
>
> * https://twitter.com/wesmckinn/status/1169277437856964614
> * https://www.slideshare.net/wesm/apache-arrow-workshop-at-vldb-2019-boss-session-169065658
>
>
> I added that point in a PR to the blogpost because I think it more accurately reflected the discussion, although I did agree with it.
>
> Let me try to rephrase your idea, to see if I get it:
> Instead of having a spec, and everybody building op top of the spec independently and have many people would be attacking the same issues, CI/building/distributing etc. Instead, if we put it in 1 repo, and collaborate, we share that burden. Is that somewhat accurate?
>
>
>
>
> More generally, I'm interested to understand at what point projects
> would be able to take on Apache Arrow as dependency. The goal of the
> project (and why I've invested ~4 years of my life and counting in it)
> is to make everyone's lives _easier_, not harder. It seems to me to be
> an inevitability of time, and so if there is work that we can be
> prioritizing to speed along this outcome, please let me know.
>
>
> As mentioned above, I’m almost there (take Arrow as a dependency).
> I would also really love to see the string algorithms go in Arrow (though we disagree on the details maybe). I currently have no funding or serious time to spend on that, but happy to share my thoughts, experiences and help where I can.
>
> I’m pretty happy with this reply actually, since you’ve clarified a lot for me. The tone could be a bit different, but given the way you saw my actions it makes more sense. I hope I’ve taken away some frustrations and hope we can build bridges, and a better world :) (well, at least regarding dataframes).
> If you still feel I’ve done something which goes again you/Arrow or step on your toes, let me know, publicly or privately. We cannot guess people’s motives or incentives, and if you are unintentionally frustrated by my actions, that’s a waste of energy. I’d rather be vaex to be a stimulus for Apache Arrow than a source of frustration.
>
>
> Cheers,
>
> Maarten Breddels
>
>
>
> Thanks,
> Wes
>
> On Tue, Jul 16, 2019 at 6:25 AM Marc Garcia <garcia.marc at gmail.com> wrote:
>
>
> For the people who has shown interest in joining remote, I added you to the repo of the summit [1], feel free to open issues there of the topics you're interested in discussing. I also created a Gitter channel that you can join.
>
> EuroSciPy doesn't currently have budget to life stream the session, but if we find a sponsor we'll do it, and also publish the recording in youtube. Based on the experience with the European pandas summit this seems unlikely.
>
> Cheers!
>
> 1. https://github.com/python-sprints/dataframe-summit
> 2. https://gitter.im/py-sprints/dataframe-summit
>
>
> On Wed, Jul 10, 2019 at 8:30 AM Pietro Battiston <me at pietrobattiston.it> wrote:
>
>
> Hi Marc,
>
> cool!
>
> I won't be able to attend Euroscipy, but if in the "Maintainers
> session" you plan to have a way to participate remotely, I'll
> definitely do.
>
> (I might be busy on the 6th instead... still don't know for sure)
>
> Pietro
>
> Il giorno gio, 04/07/2019 alle 15.45 +0100, Marc Garcia ha scritto:
>
> Hi there,
>
> Just to let you know that at EuroSciPy 2019 (in September in Spain)
> we will have a dataframe summit, to stay updated and coordinate among
> projects replicating the pandas API (other dataframe projects are
> more than welcome).
>
> Maintainers from all the main projects (pandas, dask, vaex, modin,
> cudf and koalas) will be attending. If you want to get involved
> (whether you can attend the conference or not), please DM me.
>
> More info: https://github.com/python-sprints/dataframe-summit
> Conference website: https://www.euroscipy.org/2019/
>
> Cheers!
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev