[Pandas-dev] [pydata] Pandas 2.0 Design Request: A more dplyr-like API

Wed Jun 14 11:30:23 EDT 2017

2017-06-14 17:24 GMT+02:00 Paul Hobson <pmhobson at gmail.com>:

> Just my 2 cents on indexes:
>
> Every time I think I'm done with them and don't need them any more, I get
> into some weird situation where a complex, nested, categorical index makes
> my life soooo much easier.
>
> I recognize that if the library and general community doesn't need them,
> they can represent a significant maintenance burden. But they saved my ass
> a couple of times this week.
>
>
Stephan mentioned some ideas to make those cases where you don't need them
easier (eg allow not to have an index), but there are no plans to ditch
Indexes altogether (if you look at the linked issue, it speaks about
"optional indexes", but Stephan's wording in the mail below was maybe a bit
misleading).

Joris


> -paul
>
> On Tue, Jun 13, 2017 at 5:03 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
>> Hi Chris,
>>
>> I think most of us agree with you. We've been slowly moving in this
>> direction (e.g., with .assign()) and hope to do more. For example, see our speculative
>> discussion <https://github.com/pandas-dev/pandas2/issues/17> concerning
>> getting rid of indexes for pandas2 and a proposal for allowing indexes
>> to be referenced by name
>> <https://github.com/pandas-dev/pandas/issues/8162>.
>>
>> There are a few major obstacles here:
>> 1. Coming up with concrete plans for how new APIs should work. This is
>> harder than just copying dplyr, because we don't have access to
>> non-standard evaluation in Python.
>> 2. Figuring out how to deprecate/replace existing behavior in a minimally
>> painful way, to minimize clutter of the pandas API. (Arguably, we already
>> have too many methods.)
>> 3. Actually implementing these changes in a consistent fashion in the
>> complex pandas codebase.
>>
>> These are all important work, but only the last item requires actually
>> writing code. Help would be appreciated on all of these.
>>
>> It's worth noting that some of this may actually be easier to do outside
>> of pandas proper. For example, Wes and Phil have been working on a pandas
>> backend to Ibis <https://github.com/ibis-project/ibis>.
>>
>> Best,
>> Stephan
>>
>> On Tue, Jun 13, 2017 at 3:48 PM, Chris Said <chris.said at gmail.com> wrote:
>>
>>> Hi Pandas developers,
>>>
>>> I want to start by thanking all of the pandas developers for the effort
>>> they've put into the project. So much of what you do is thankless, and I
>>> want you to know it is really appreciated. Pandas is a huge part of my
>>> day-to-day coding.
>>>
>>> Because I use it so much, I want to submit a request. I want somebody to
>>> #MakePandasMoreLikeDplyr. To me and to almost everyone else I've talked to
>>> who knows pandas and dplyr, this is more important than performance
>>> improvements and arguably more important than most of the goals in the pandas
>>> 2.0 design docs <https://pandas-dev.github.io/pandas2/>.
>>>
>>> I'm not an R guy. 95% of my work is done in pandas. But everyone I know
>>> who uses pandas is constantly having to google how to do things. In
>>> contrast, dplyr feels like coding at the speed of thought. In particular,
>>> the combination of groupby->{mutate, summarize} is incredibly natural
>>> <https://twitter.com/Chris_Said/status/715249097326768128>. It is so
>>> easy to create multiple named output columns from multiple input columns.
>>> That's because the definition of new columns, with reference to multiple
>>> input columns, is all done inside the call to mutate / summarize. With
>>> pandas, it's much more complicated and hard to remember
>>> <https://gist.github.com/TomAugspurger/37097cce6a3368a8dad7cf2c8a9e5e92>
>>> .
>>>
>>> The new transform method in 0.20 gets us part of the way there. But
>>> instead of allowing users to name the output columns, it returns multi-indexed
>>> columns <https://twitter.com/Chris_Said/status/861245185002323968>,
>>> which for me and most other people I've talked to are unwanted
>>> <https://news.ycombinator.com/item?id=14548695>.
>>>
>>> Thank you again for all your hard work. Just as a TL;DR: More like
>>> dplyr, less injection of multi-indexes. (Could they be eliminated entirely?)
>>>
>>> Best,
>>> Chris
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "PyData" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to pydata+unsubscribe at googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "PyData" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to pydata+unsubscribe at googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "PyData" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pydata+unsubscribe at googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170614/bbcdb063/attachment.html>