From jorisvandenbossche at gmail.com  Fri Aug  5 11:29:19 2022
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Fri, 5 Aug 2022 17:29:19 +0200
Subject: [Pandas-dev] PDEPs: pandas enhancement proposals
In-Reply-To: <CAEk5N5sXn54HBLWGjPF55XFdbhBh1w=TpsP9wuwDzrYy=GVZeQ@mail.gmail.com>
References: <CAEk5N5s=tiRAcZtv3jkYuvwyqZacih04z13+i=m+3EBoo2qReA@mail.gmail.com>
 <94BC0B63-20D8-4341-A440-675DC9F82D4E@gmail.com>
 <CACvdwiOYM-8+J5HHOtxKD1K7xKbbHhBQi5XR4FyGu3Of41PHUw@mail.gmail.com>
 <CALQtMBYxn4eb5aeP5xM8AGKrs9_qsy_iRnyw4znGZebzybgADg@mail.gmail.com>
 <CAEk5N5sXn54HBLWGjPF55XFdbhBh1w=TpsP9wuwDzrYy=GVZeQ@mail.gmail.com>
Message-ID: <CALQtMBbMbL0wrriZ8x=hABET5g8CGJux+njC51tr67m=ge_QWg@mail.gmail.com>

On Wed, 20 Jul 2022 at 16:26, Marc Garcia <garcia.marc at gmail.com> wrote:

> Ok, correct me if I'm wrong, but for what you say the options to consider
> are:
>
> 1) Keep everything as is (in the main pandas repo), and maybe improve
> notifications (send emails to a list where people can subscribe, rss feed,
> telegram messages...)
>
> 2) Use a new repo for PRs, and use the main pandas website to display them
>   2.a) On the website build, fetch the PDEP docs from the other repo
>   2.b) On the PDEP repo CI, push the PDEP docs to the main pandas repo
>
> 3) Use a new repo for the PDEP PRs, and have a separate website for PDEPs
>
> Does these options sound reasonable as the ones to discuss? Or am I
> missing something?
>
> My preference is 1, as I think it's the simplest, and adding notifications
> that allow following PDEPs separate from all pandas activity doesn't seem
> complex.
>
> I'm also fine with 2.a if more people have a strong opinion about keeping
> PDEP discussions/PRs in a separate repo. I personally don't see advantages
> in 2.b and 3.
>

Thanks, that's indeed a good summary of the options. I also think 2.a is
the easiest of the alternatives, so I would indeed only consider 1 and 2.a.

My preference is to go with 2 (separate repo), for the reasons mentioned
before. I think Tom also mentioned this as his preference, and Jeff being
OK with it, while Marc/Matthew prefer the main repo. But it would be good
to hear from others as well whether they have a (strong) preference.

If we decide to do that, I am happy to do a PR to update the publishing
workflow to handle a separate repo.

Joris


>
> On Mon, Jul 18, 2022, 18:34 Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> Thanks Marc for the detailed answer.
>> In general, I personally think that the added complexity is not that big,
>> and we can still have a nice publishing workflow to the website with a
>> separate repo (some more detailed responses inline below).
>>
>> For someone who wants to follow the PDEPs (and I hope with this new PDEP
>> process we can engage more people in the pandas community), but doesn't
>> have the time follow all of pandas (eg a maintainer of a dependent package,
>> ..), my hunch is that a separate repo is a more accessible way to do this.
>> You can indeed list all related PRs based on a label filter, but you
>> still need to know this (we can of course document that on the roadmap
>> page) and it's not an automatic notification. And for email notifications
>> you can indeed set up an email filter (although I don't think you have a
>> good option if using github notifications?).
>>
>> For someone as myself, if we end up using the main repo, I can for sure
>> set up those filters, that is not a problem. But in general I think that is
>> not a very accessible way to have people follow those discussions. Having
>> it as a separate repo provides a clear home and gives you all the tools
>> that github has to manage and customize the notifications however you want
>> (eg watch one repo and not the other).
>>
>> Sidenote: I do (or did) this for other projects, such as numpy or python.
>> I don't follow either of their issue trackers, but I do (somewhat) follow
>> NEP or PEP discussions, and both give me a way to do that without having to
>> follow their main issue trackers.
>>
>> The last point that you raise about "forgetting about a separate repo" is
>> certainly a valid concern. It's true that the other separate repos that we
>> have (had) were no success, so we don't have a good track record on this
>> front. But I do think it is a matter of habit (and
>> documentation/communication! we never really publicized any of the other
>> repos, neither actively used them at any point), and if ensure we have a
>> steady activity in such a separate repo for a while, I think that will grow
>> naturally.
>>
>> On Sat, 25 Jun 2022 at 21:19, Matthew Roeschke <emailformattr at gmail.com>
>> wrote:
>>
>>> I find Marc's arguments regarding general simplicity of PDEP flow
>>> (publishing to website & integration to the main repo) a strong argument to
>>> keep these in the main repo.
>>>
>>> Since there is a dependency between PDEP development and the pandas-dev
>>> repo development, having them separated may lead to similar workflow
>>> challenges with the MacPython/pandas-wheels repo for example (where ciwheelbuild
>>> being integrated into the main repo
>>> <https://github.com/pandas-dev/pandas/issues/44027> is considered a
>>> benefit due to tighter integration).
>>>
>>
>> I think an important difference here is that building wheels is defined
>> in the pandas repo (packaging setup) and often needs fixes in pandas, and
>> so here it indeed makes that workflow much easier to have that in the same
>> repo. For PDEPs that is much less of an issue.
>>
>>>
>>> I agree PDEP visibility from notifications is important, but
>>> notification priority and channels can differ person-to-person. For
>>> example, I just manage my GIthub notifications in Github, not email.
>>>
>>> I don't think there is fundamentally a difference between both. Also if
>> I was using github notifications, seeing a specific subset of issues in
>> those is challenging (while when using email I could at least set up some
>> automatic filters).
>> (but I don't know github notifications well, so I might be wrong)
>>
>>
>>> On Sat, Jun 25, 2022 at 10:50 AM Tom Augspurger <
>>> tom.w.augspurger at gmail.com> wrote:
>>>
>>>> For me, notifications are the big thing. Having the emails come from a
>>>> separate repo would make following things much easier for those who can?t
>>>> keep up with the main repo.
>>>>
>>>> Tom
>>>>
>>>> On Jun 25, 2022, at 12:04 PM, Marc Garcia <garcia.marc at gmail.com>
>>>> wrote:
>>>>
>>>> ?
>>>> Thanks for the feedback. I understand your point about using a
>>>> different repo, but I see several advantages on the current approach, so
>>>> maybe worth discussing a bit further what are the exact pain points, to see
>>>> if a separate repo is really the best solution.
>>>>
>>>> Let me know if I miss something, but I see three different ways in
>>>> which we'll be interacting with PDEPs:
>>>>
>>>> a) Via their rendered version. Not sure if you checked it, but the
>>>> current rendered page from the PDEP PR (attached) is equivalent to the home
>>>> of the scikit-learn SLEP proposals [1]. The main difference is that with
>>>> the current approach we have it integrated with the website, which I
>>>> personally think it's an advantage.
>>>>
>>>> I am assuming that also with a separate repo we will have an identical
>> web page (which wil be very useful!).
>>
>>>
>>>> b) Via the list of PDEP PRs to review. In this case, to see only PDEP
>>>> PRs, if we use the main pandas repo, this is just a label filter [2]. To me
>>>> personally quicker than having to go to another repo, but no big difference
>>>> about one or the other.
>>>>
>>>> c) Notifications. I guess this is the main thing. I think one concern
>>>> is that notifications from PDEPs get lost in the rest of the repo
>>>> notifications. I assume you're using your email client filters, and if the
>>>> notifications come from another repo, you can change the rules easily. I
>>>> guess the solution here would be to use something like PDEP in the title
>>>> and use that as a rule. Or we can try to find something more reliable, if
>>>> that's the main concern.
>>>>
>>>> Personally, I don't see the advantages of having the proposals in a
>>>> separate repo very significant. And by keeping things the way they're
>>>> implemented in the PR, I do see some advantages:
>>>> - No need to maintain a separate repo, CI workflow, jobs to publish the
>>>> build, sphinx (or equivalent) project... Nothing too complex, by why having
>>>> to implement and maintain all that if our website is already prepared to
>>>> handle it. And in particular, with Sphinx is not as easy as with out
>>>> website to fetch the open PRs and render them.
>>>> - Integrated UX of the PDEPs into our website. I think this gives it
>>>> more visibility, and a better using experience than having to jump from one
>>>> website to another.
>>>>
>>>> I think it should certainly be possible to keep the website UX as you
>> implemented with a separate repo as well.
>> There are for sure multiple options, but one (maybe simplest) option
>> would be to keep the publishing in the main repo as you have now (since the
>> website publishing lives there): for example the separate repo could
>> additionally be cloned in the website workflow, and then that content is
>> available as well (requiring to change the path in the script a bit).
>>
>> The PDEP repo itself could further have only very limited CI?
>>
>>
>>> - One of my concerns is that being in a separate repo we forget about
>>>> them. We're used to check PRs in the pandas repo, and we'll keep coming
>>>> back to PRs about PDEPs until they're merged if they are in the main repo,
>>>> but feels like being in a separate repo is easier to forget them when there
>>>> is no recent activity and notifications.
>>>>
>>>> It would be good to know if I miss any of your concerns. If I didn't,
>>>> I'd say we can start with what's already implemented, which is almost ready
>>>> to get merged, and if in the future you still think we can do better by
>>>> using a separate repo, you can implement it, we have a discussion about it,
>>>> and we move PDEPs to a separate repo if that makes sense. What do you think?
>>>>
>>>> Cheers,
>>>> Marc
>>>>
>>>> 1. https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/
>>>> 2.
>>>> https://github.com/pandas-dev/pandas/pulls?q=is%3Aopen+is%3Apr+label%3APDEP
>>>>
>>>>
>>>> On Sat, Jun 25, 2022 at 7:05 AM Jeff Reback <jeffreback at gmail.com>
>>>> wrote:
>>>>
>>>>> +1 in using a separate repo (under pandas-dev) for this
>>>>>
>>>>>
>>>>> On Jun 24, 2022, at 5:05 PM, Joris Van den Bossche <
>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>
>>>>> ?
>>>>> Thanks for starting this proposal, Marc!
>>>>>
>>>>> I have already been doing this in some ad-hoc way with eg the
>>>>> Copy/View proposal (writing an actual proposal document), so I am very much
>>>>> in favor of formalizing this a bit more.
>>>>>
>>>>> Personally, I would prefer that we use a more dedicated home for this
>>>>> instead of using the existing pandas repo (e.g. a separate repo in the
>>>>> pandas-dev org). The main pandas repo has nowadays such a high volume in
>>>>> issue and PR comments, that it becomes difficult to follow this or notice
>>>>> specific issues. While there are certainly ways to deal with this (e.g.
>>>>> consistently using a specific label and title, ensuring we always notify
>>>>> the mailing list as well, ...), IMO it would make it more accessible to
>>>>> follow and have an overview of those discussions in e.g. a separate repo.
>>>>>
>>>>> (there are examples of both in other projects, for example
>>>>> scikit-learn has a separate repo, while bumpy uses the main repo I think)
>>>>>
>>>>> Joris
>>>>>
>>>>> Op di 21 jun. 2022 09:46 schreef Marc Garcia <garcia.marc at gmail.com>:
>>>>>
>>>>>> We're in the process of implementing PDEPs, equivalent to Python's
>>>>>> PEPs and NumPy's NEPs, but for pandas. This should help build the roadmap,
>>>>>> make discussions more efficient, obtain more structured feedback from the
>>>>>> community, and add visibility to agreed future plans for pandas.
>>>>>>
>>>>>> The initial implementation (workflow) is a bit simpler than PEP or
>>>>>> NEP, but we'll iterate in the future as convenient.
>>>>>>
>>>>>> You can see the PR for PDEP-1 with the purpose, scope and guidelines
>>>>>> here: https://github.com/pandas-dev/pandas/pull/47444
>>>>>>
>>>>>> Feedback is very welcome.
>>>>>> _______________________________________________
>>>>>> Pandas-dev mailing list
>>>>>> Pandas-dev at python.org
>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>>> [image: Screenshot at 2022-06-25 22-20-50.png]
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>>
>>>
>>> --
>>> Matthew Roeschke
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220805/907dcbf9/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot at 2022-06-25 22-20-50.png
Type: image/png
Size: 192689 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220805/907dcbf9/attachment-0001.png>

From simonjayhawkins at gmail.com  Fri Aug  5 11:41:45 2022
From: simonjayhawkins at gmail.com (Simon Hawkins)
Date: Fri, 5 Aug 2022 16:41:45 +0100
Subject: [Pandas-dev] PDEPs: pandas enhancement proposals
In-Reply-To: <CALQtMBbMbL0wrriZ8x=hABET5g8CGJux+njC51tr67m=ge_QWg@mail.gmail.com>
References: <CAEk5N5s=tiRAcZtv3jkYuvwyqZacih04z13+i=m+3EBoo2qReA@mail.gmail.com>
 <94BC0B63-20D8-4341-A440-675DC9F82D4E@gmail.com>
 <CACvdwiOYM-8+J5HHOtxKD1K7xKbbHhBQi5XR4FyGu3Of41PHUw@mail.gmail.com>
 <CALQtMBYxn4eb5aeP5xM8AGKrs9_qsy_iRnyw4znGZebzybgADg@mail.gmail.com>
 <CAEk5N5sXn54HBLWGjPF55XFdbhBh1w=TpsP9wuwDzrYy=GVZeQ@mail.gmail.com>
 <CALQtMBbMbL0wrriZ8x=hABET5g8CGJux+njC51tr67m=ge_QWg@mail.gmail.com>
Message-ID: <CAFs9qGvpOxZwRWgT50DU3v0Y=YBBbLecM_o-VbFp14x1jUPSfQ@mail.gmail.com>

On Fri, 5 Aug 2022 at 16:29, Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

>
> On Wed, 20 Jul 2022 at 16:26, Marc Garcia <garcia.marc at gmail.com> wrote:
>
>> Ok, correct me if I'm wrong, but for what you say the options to consider
>> are:
>>
>> 1) Keep everything as is (in the main pandas repo), and maybe improve
>> notifications (send emails to a list where people can subscribe, rss feed,
>> telegram messages...)
>>
>> 2) Use a new repo for PRs, and use the main pandas website to display them
>>   2.a) On the website build, fetch the PDEP docs from the other repo
>>   2.b) On the PDEP repo CI, push the PDEP docs to the main pandas repo
>>
>> 3) Use a new repo for the PDEP PRs, and have a separate website for PDEPs
>>
>> Does these options sound reasonable as the ones to discuss? Or am I
>> missing something?
>>
>> My preference is 1, as I think it's the simplest, and adding
>> notifications that allow following PDEPs separate from all pandas activity
>> doesn't seem complex.
>>
>> I'm also fine with 2.a if more people have a strong opinion about keeping
>> PDEP discussions/PRs in a separate repo. I personally don't see advantages
>> in 2.b and 3.
>>
>
> Thanks, that's indeed a good summary of the options. I also think 2.a is
> the easiest of the alternatives, so I would indeed only consider 1 and 2.a.
>
> My preference is to go with 2 (separate repo), for the reasons mentioned
> before. I think Tom also mentioned this as his preference, and Jeff being
> OK with it, while Marc/Matthew prefer the main repo. But it would be good
> to hear from others as well whether they have a (strong) preference.
>

maybe another quick poll with just those 2 options, it seemed to get a
swift resolution on the PDEP name.

if not a clear majority, then we would need further discussion.

if a majority, then maybe only a few pain points to resolve.


>
> If we decide to do that, I am happy to do a PR to update the publishing
> workflow to handle a separate repo.
>
> Joris
>
>
>>
>> On Mon, Jul 18, 2022, 18:34 Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> Thanks Marc for the detailed answer.
>>> In general, I personally think that the added complexity is not that
>>> big, and we can still have a nice publishing workflow to the website with a
>>> separate repo (some more detailed responses inline below).
>>>
>>> For someone who wants to follow the PDEPs (and I hope with this new PDEP
>>> process we can engage more people in the pandas community), but doesn't
>>> have the time follow all of pandas (eg a maintainer of a dependent package,
>>> ..), my hunch is that a separate repo is a more accessible way to do this.
>>> You can indeed list all related PRs based on a label filter, but you
>>> still need to know this (we can of course document that on the roadmap
>>> page) and it's not an automatic notification. And for email notifications
>>> you can indeed set up an email filter (although I don't think you have a
>>> good option if using github notifications?).
>>>
>>> For someone as myself, if we end up using the main repo, I can for sure
>>> set up those filters, that is not a problem. But in general I think that is
>>> not a very accessible way to have people follow those discussions. Having
>>> it as a separate repo provides a clear home and gives you all the tools
>>> that github has to manage and customize the notifications however you want
>>> (eg watch one repo and not the other).
>>>
>>> Sidenote: I do (or did) this for other projects, such as numpy or
>>> python. I don't follow either of their issue trackers, but I do (somewhat)
>>> follow NEP or PEP discussions, and both give me a way to do that without
>>> having to follow their main issue trackers.
>>>
>>> The last point that you raise about "forgetting about a separate repo"
>>> is certainly a valid concern. It's true that the other separate repos that
>>> we have (had) were no success, so we don't have a good track record on this
>>> front. But I do think it is a matter of habit (and
>>> documentation/communication! we never really publicized any of the other
>>> repos, neither actively used them at any point), and if ensure we have a
>>> steady activity in such a separate repo for a while, I think that will grow
>>> naturally.
>>>
>>> On Sat, 25 Jun 2022 at 21:19, Matthew Roeschke <emailformattr at gmail.com>
>>> wrote:
>>>
>>>> I find Marc's arguments regarding general simplicity of PDEP flow
>>>> (publishing to website & integration to the main repo) a strong argument to
>>>> keep these in the main repo.
>>>>
>>>> Since there is a dependency between PDEP development and the pandas-dev
>>>> repo development, having them separated may lead to similar workflow
>>>> challenges with the MacPython/pandas-wheels repo for example (where ciwheelbuild
>>>> being integrated into the main repo
>>>> <https://github.com/pandas-dev/pandas/issues/44027> is considered a
>>>> benefit due to tighter integration).
>>>>
>>>
>>> I think an important difference here is that building wheels is defined
>>> in the pandas repo (packaging setup) and often needs fixes in pandas, and
>>> so here it indeed makes that workflow much easier to have that in the same
>>> repo. For PDEPs that is much less of an issue.
>>>
>>>>
>>>> I agree PDEP visibility from notifications is important, but
>>>> notification priority and channels can differ person-to-person. For
>>>> example, I just manage my GIthub notifications in Github, not email.
>>>>
>>>> I don't think there is fundamentally a difference between both. Also if
>>> I was using github notifications, seeing a specific subset of issues in
>>> those is challenging (while when using email I could at least set up some
>>> automatic filters).
>>> (but I don't know github notifications well, so I might be wrong)
>>>
>>>
>>>> On Sat, Jun 25, 2022 at 10:50 AM Tom Augspurger <
>>>> tom.w.augspurger at gmail.com> wrote:
>>>>
>>>>> For me, notifications are the big thing. Having the emails come from a
>>>>> separate repo would make following things much easier for those who can?t
>>>>> keep up with the main repo.
>>>>>
>>>>> Tom
>>>>>
>>>>> On Jun 25, 2022, at 12:04 PM, Marc Garcia <garcia.marc at gmail.com>
>>>>> wrote:
>>>>>
>>>>> ?
>>>>> Thanks for the feedback. I understand your point about using a
>>>>> different repo, but I see several advantages on the current approach, so
>>>>> maybe worth discussing a bit further what are the exact pain points, to see
>>>>> if a separate repo is really the best solution.
>>>>>
>>>>> Let me know if I miss something, but I see three different ways in
>>>>> which we'll be interacting with PDEPs:
>>>>>
>>>>> a) Via their rendered version. Not sure if you checked it, but the
>>>>> current rendered page from the PDEP PR (attached) is equivalent to the home
>>>>> of the scikit-learn SLEP proposals [1]. The main difference is that with
>>>>> the current approach we have it integrated with the website, which I
>>>>> personally think it's an advantage.
>>>>>
>>>>> I am assuming that also with a separate repo we will have an identical
>>> web page (which wil be very useful!).
>>>
>>>>
>>>>> b) Via the list of PDEP PRs to review. In this case, to see only PDEP
>>>>> PRs, if we use the main pandas repo, this is just a label filter [2]. To me
>>>>> personally quicker than having to go to another repo, but no big difference
>>>>> about one or the other.
>>>>>
>>>>> c) Notifications. I guess this is the main thing. I think one concern
>>>>> is that notifications from PDEPs get lost in the rest of the repo
>>>>> notifications. I assume you're using your email client filters, and if the
>>>>> notifications come from another repo, you can change the rules easily. I
>>>>> guess the solution here would be to use something like PDEP in the title
>>>>> and use that as a rule. Or we can try to find something more reliable, if
>>>>> that's the main concern.
>>>>>
>>>>> Personally, I don't see the advantages of having the proposals in a
>>>>> separate repo very significant. And by keeping things the way they're
>>>>> implemented in the PR, I do see some advantages:
>>>>> - No need to maintain a separate repo, CI workflow, jobs to publish
>>>>> the build, sphinx (or equivalent) project... Nothing too complex, by why
>>>>> having to implement and maintain all that if our website is already
>>>>> prepared to handle it. And in particular, with Sphinx is not as easy as
>>>>> with out website to fetch the open PRs and render them.
>>>>> - Integrated UX of the PDEPs into our website. I think this gives it
>>>>> more visibility, and a better using experience than having to jump from one
>>>>> website to another.
>>>>>
>>>>> I think it should certainly be possible to keep the website UX as you
>>> implemented with a separate repo as well.
>>> There are for sure multiple options, but one (maybe simplest) option
>>> would be to keep the publishing in the main repo as you have now (since the
>>> website publishing lives there): for example the separate repo could
>>> additionally be cloned in the website workflow, and then that content is
>>> available as well (requiring to change the path in the script a bit).
>>>
>>> The PDEP repo itself could further have only very limited CI?
>>>
>>>
>>>> - One of my concerns is that being in a separate repo we forget about
>>>>> them. We're used to check PRs in the pandas repo, and we'll keep coming
>>>>> back to PRs about PDEPs until they're merged if they are in the main repo,
>>>>> but feels like being in a separate repo is easier to forget them when there
>>>>> is no recent activity and notifications.
>>>>>
>>>>> It would be good to know if I miss any of your concerns. If I didn't,
>>>>> I'd say we can start with what's already implemented, which is almost ready
>>>>> to get merged, and if in the future you still think we can do better by
>>>>> using a separate repo, you can implement it, we have a discussion about it,
>>>>> and we move PDEPs to a separate repo if that makes sense. What do you think?
>>>>>
>>>>> Cheers,
>>>>> Marc
>>>>>
>>>>> 1.
>>>>> https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/
>>>>> 2.
>>>>> https://github.com/pandas-dev/pandas/pulls?q=is%3Aopen+is%3Apr+label%3APDEP
>>>>>
>>>>>
>>>>> On Sat, Jun 25, 2022 at 7:05 AM Jeff Reback <jeffreback at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> +1 in using a separate repo (under pandas-dev) for this
>>>>>>
>>>>>>
>>>>>> On Jun 24, 2022, at 5:05 PM, Joris Van den Bossche <
>>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>>
>>>>>> ?
>>>>>> Thanks for starting this proposal, Marc!
>>>>>>
>>>>>> I have already been doing this in some ad-hoc way with eg the
>>>>>> Copy/View proposal (writing an actual proposal document), so I am very much
>>>>>> in favor of formalizing this a bit more.
>>>>>>
>>>>>> Personally, I would prefer that we use a more dedicated home for this
>>>>>> instead of using the existing pandas repo (e.g. a separate repo in the
>>>>>> pandas-dev org). The main pandas repo has nowadays such a high volume in
>>>>>> issue and PR comments, that it becomes difficult to follow this or notice
>>>>>> specific issues. While there are certainly ways to deal with this (e.g.
>>>>>> consistently using a specific label and title, ensuring we always notify
>>>>>> the mailing list as well, ...), IMO it would make it more accessible to
>>>>>> follow and have an overview of those discussions in e.g. a separate repo.
>>>>>>
>>>>>> (there are examples of both in other projects, for example
>>>>>> scikit-learn has a separate repo, while bumpy uses the main repo I think)
>>>>>>
>>>>>> Joris
>>>>>>
>>>>>> Op di 21 jun. 2022 09:46 schreef Marc Garcia <garcia.marc at gmail.com>:
>>>>>>
>>>>>>> We're in the process of implementing PDEPs, equivalent to Python's
>>>>>>> PEPs and NumPy's NEPs, but for pandas. This should help build the roadmap,
>>>>>>> make discussions more efficient, obtain more structured feedback from the
>>>>>>> community, and add visibility to agreed future plans for pandas.
>>>>>>>
>>>>>>> The initial implementation (workflow) is a bit simpler than PEP or
>>>>>>> NEP, but we'll iterate in the future as convenient.
>>>>>>>
>>>>>>> You can see the PR for PDEP-1 with the purpose, scope and guidelines
>>>>>>> here: https://github.com/pandas-dev/pandas/pull/47444
>>>>>>>
>>>>>>> Feedback is very welcome.
>>>>>>> _______________________________________________
>>>>>>> Pandas-dev mailing list
>>>>>>> Pandas-dev at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Pandas-dev mailing list
>>>>>> Pandas-dev at python.org
>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>
>>>>>> [image: Screenshot at 2022-06-25 22-20-50.png]
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>>
>>>>
>>>> --
>>>> Matthew Roeschke
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220805/cf65c03e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot at 2022-06-25 22-20-50.png
Type: image/png
Size: 192689 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220805/cf65c03e/attachment-0001.png>

From jorisvandenbossche at gmail.com  Tue Aug  9 17:15:22 2022
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Tue, 9 Aug 2022 23:15:22 +0200
Subject: [Pandas-dev] August 2022 monthly community meeting (Wednesday
 August 10, UTC 18:00)
Message-ID: <CALQtMBZA3vEp3qx5xgj5LZeY2egnAe1zSdNv=HPOQPeFqKzGAg@mail.gmail.com>

Hi all,

A reminder that the next monthly dev call is tomorrow (Wednesday, August
10) at 18:00 UTC (1pm Central). Our calendar is at
https://pandas.pydata.org/docs/development/meeting.html#calendar to check
your local time.
All are welcome to attend!

Video Call:
https://us06web.zoom.us/j/84484803210?pwd=TjUxNmcyNHcvcG9SNGJvbE53Y21GZz09
Minutes:
https://docs.google.com/document/u/1/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?ouid=102771015311436394588&usp=docs_home&ths=true

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220809/aaec85da/attachment.html>

From t.alberto at hotmail.com  Wed Aug 10 21:43:45 2022
From: t.alberto at hotmail.com (Tiago Alberto)
Date: Thu, 11 Aug 2022 01:43:45 +0000
Subject: [Pandas-dev] this is a thank you message
Message-ID: <CPXPR80MB53182E8FDA092961EEFDF05EF6649@CPXPR80MB5318.lamprd80.prod.outlook.com>

I really dont know who u folks are, but pandas changed my life forever.

Just a student of course, a beginner, but completely conscious about how f&*ing terrifying and extenous job u guys have been doing so far!

Tnks for all of that folks... for real..

A Brazilian enthusiast...
A., Tiago.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220811/a1002e47/attachment.html>

From garcia.marc at gmail.com  Thu Aug 11 03:16:58 2022
From: garcia.marc at gmail.com (Marc Garcia)
Date: Thu, 11 Aug 2022 14:16:58 +0700
Subject: [Pandas-dev] this is a thank you message
In-Reply-To: <CPXPR80MB53182E8FDA092961EEFDF05EF6649@CPXPR80MB5318.lamprd80.prod.outlook.com>
References: <CPXPR80MB53182E8FDA092961EEFDF05EF6649@CPXPR80MB5318.lamprd80.prod.outlook.com>
Message-ID: <CAEk5N5tiaKm4KnU2hCM67L=_7eSFhUUsekU6LgG53eCtHZhmdg@mail.gmail.com>

Thanks for the kind words Tiago. If you want to know a bit about who we
are, you can have a look at the Team page in our website:
https://pandas.pydata.org/about/team.html

On Thu, Aug 11, 2022 at 8:47 AM Tiago Alberto <t.alberto at hotmail.com> wrote:

> I really dont know who u folks are, but pandas changed my life forever.
>
> Just a student of course, a beginner, but completely conscious about how
> f&*ing terrifying and extenous job u guys have been doing so far!
>
> Tnks for all of that folks... for real..
>
> A Brazilian enthusiast...
> A., Tiago.
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220811/94ab3543/attachment.html>

From garcia.marc at gmail.com  Thu Aug 18 00:02:15 2022
From: garcia.marc at gmail.com (Marc Garcia)
Date: Thu, 18 Aug 2022 11:02:15 +0700
Subject: [Pandas-dev] Retention of pandas nightly builds
Message-ID: <CAEk5N5uberdE2wPfSxnSQoP3W84bFuJa-N4M0s6NQeTKNW5pdw@mail.gmail.com>

Hi all,

Seems like at the moment we're keeping all history of pandas nightly wheels
at https://anaconda.org/scipy-wheels-nightly and they are using around 100
Gb  storage.

Other projects are in general keeping the last 5 days of builds, or in some
cases like scikit-learn they just keep the last.

I assume there is no reason to keep all history more than not having an
automated way to delete old wheels yet. And I personally can't think of a
use case to need more than the last build, so if we add a CI job to delete
old wheels, my preference would be to just keep overwritting the same
files, which I think makes things simpler for us and for users. But maybe
other people do have use cases for keeping the last few days of wheels.

Does anyone have the need for something like a 5 days retention of nightly
pandas wheels? We'll start removing old wheels in couple of days, or once
the discussion here has a conclusion.

Cheers!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220818/7dd38f3a/attachment.html>

From tcaswell at gmail.com  Fri Aug 19 15:23:07 2022
From: tcaswell at gmail.com (Thomas Caswell)
Date: Fri, 19 Aug 2022 15:23:07 -0400
Subject: [Pandas-dev] Retention of pandas nightly builds
In-Reply-To: <CAEk5N5uberdE2wPfSxnSQoP3W84bFuJa-N4M0s6NQeTKNW5pdw@mail.gmail.com>
References: <CAEk5N5uberdE2wPfSxnSQoP3W84bFuJa-N4M0s6NQeTKNW5pdw@mail.gmail.com>
Message-ID: <CAA48SF8anX4S7p4oxKr5Q5XczkjE1i21cL=Mr_NeZDoG0d4ujg@mail.gmail.com>

If you use unique names for each upload it makes it possible for someone
who wants to keep a longer running log to do so.  It also lets people be
able to do a little bit of bisecting if something breaks.

There is progress in
https://discuss.scientific-python.org/t/interest-in-github-action-for-scipy-wheels-nightly-uploads-and-removals/397
on developing a GHA to take care of the clean up.

Tom

On Thu, Aug 18, 2022 at 12:02 AM Marc Garcia <garcia.marc at gmail.com> wrote:

> Hi all,
>
> Seems like at the moment we're keeping all history of pandas nightly
> wheels at https://anaconda.org/scipy-wheels-nightly and they are using
> around 100 Gb  storage.
>
> Other projects are in general keeping the last 5 days of builds, or in
> some cases like scikit-learn they just keep the last.
>
> I assume there is no reason to keep all history more than not having an
> automated way to delete old wheels yet. And I personally can't think of a
> use case to need more than the last build, so if we add a CI job to delete
> old wheels, my preference would be to just keep overwritting the same
> files, which I think makes things simpler for us and for users. But maybe
> other people do have use cases for keeping the last few days of wheels.
>
> Does anyone have the need for something like a 5 days retention of nightly
> pandas wheels? We'll start removing old wheels in couple of days, or once
> the discussion here has a conclusion.
>
> Cheers!
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>


-- 
Thomas Caswell
tcaswell at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220819/8e9664cb/attachment.html>

From garcia.marc at gmail.com  Thu Aug 25 00:56:11 2022
From: garcia.marc at gmail.com (Marc Garcia)
Date: Thu, 25 Aug 2022 11:56:11 +0700
Subject: [Pandas-dev] ANN: pandas 1.5.0 RC0
Message-ID: <CAEk5N5szpRNMkVQQpGZUMZTRJ2UW-qcHuZjDKD9Rk0G01j0q_w@mail.gmail.com>

We are happy to announce the *release candidate* of pandas 1.5.0.

It can be installed from our conda-forge and PyPI packages via mamba, conda
and pip, for example:

mamba install -c conda-forge/label/pandas_rc pandas==1.5.0rc0
python -m pip install --upgrade --pre pandas==1.5.0rc0

Users having pandas code in production and maintainers of libraries with
pandas as a dependency are *strongly* recommended to run their test suites
with the release candidate, and report any breaking change to our issue
tracker <https://github.com/pandas-dev/pandas/issues/new/choose> before the
official 1.5.0 release.

You can find the documentation of pandas 1.5.0 here
<https://pandas.pydata.org/pandas-docs/version/1.5.0rc0/index.html>, and
the list of changes in 1.5.0, in the release notes page
<https://pandas.pydata.org/pandas-docs/version/1.5.0rc0/whatsnew/v1.5.0.html>
.

We expect to release the final version of pandas 1.5.0 on September 7th,
but the final date will depend on the issues reported to the release
candidate.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220825/e7b3489c/attachment.html>

From garcia.marc at gmail.com  Thu Aug 25 06:18:44 2022
From: garcia.marc at gmail.com (Marc Garcia)
Date: Thu, 25 Aug 2022 17:18:44 +0700
Subject: [Pandas-dev] Plans of future PDEPs
Message-ID: <CAEk5N5tDP7x=B3UQNtu-Q+7cNiB6A9EWi0H275bENKi+9KXBpw@mail.gmail.com>

I'm planning to work on few PDEPs, and I think it's probably worth sharing
the general ideas in advance, in case anyone has early feedback, wants to
coordinate, or it's simply useful for others to know before the draft PDEPs
are ready. Not intending to open the discussion on the details of them
here, which it probably makes more sense to discuss in the draft on the
PDEP when ready.

*Governance*: We discussed about defining exactly how a PDEP is approved
while implementing them, and it's probably useful to clarify the project
decision making in general. And also, I want to propose the creation of
some workgroups within pandas to be more efficient managing things. Define
their roles, how members are selected... Besides the existing CoC and
NumFOCUS (finances) workgroups, I think it could be useful to have
workgroups for the infrastructure (manage access to our servers, the github
org, the distribution lists...), for the communications (tweeting,
blogging, coordinate with NumFOCUS and other projects...). Maybe a release
workgroup, and a steering committee if people find them useful...

*Release*: I think it would be useful to have a PDEP detailing the release
process. Similar to our policies doc
<https://pandas.pydata.org/docs/development/policies.html>, but being more
detailed and explicit on when exactly releases happen, and what's the
criteria. Not necessarily change the current release policy, but I think we
can better manage expectations of both users and developers, and be more
efficient in the discussions if there is less ambiguity on when releases
are going to happen.

*IO modules as plugins*: The idea would be to create a framework to extend
pandas with IO functions, and move some of the current IO adaptors (e.g.
stata, spss, feather...) to third-party projects. I think this would
encourage people to build more pandas IO packages (for new formats, or
improved versions of the existing), it would make these modules better
maintained (easier to become a maintainer, maintainers experts in the
formats...), and would reduce the complexity of pandas itself, the CI,
reduce significantly the number of optional dependencies, and offer users a
more natural way to install things (install pandas-sas, as opposed to have
read_sas failing and having to install its dependencies).

*Duplicate columns*: pandas currently supports it, but it's not clear if
there are many (or any) use cases where this is helpful, and this clearly
makes pandas more complex, in its internals, and in its usability (e.g.
almost all users would expect `df['my_col']` to return a Series, but no
necessarily true, since there can be many columns "my_col", and it could
return a DataFrame). I think it's useful to create a PDEP with advantages
and actual use cases of having duplicate columns, and its cost in terms of
problems and extra complexity. And make a decision on whether we want to
continue supporting it.

Cheers,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220825/073c0469/attachment.html>

From jeffreback at gmail.com  Thu Aug 25 10:27:02 2022
From: jeffreback at gmail.com (Jeff Reback)
Date: Thu, 25 Aug 2022 10:27:02 -0400
Subject: [Pandas-dev] Plans of future PDEPs
In-Reply-To: <CAEk5N5tDP7x=B3UQNtu-Q+7cNiB6A9EWi0H275bENKi+9KXBpw@mail.gmail.com>
References: <CAEk5N5tDP7x=B3UQNtu-Q+7cNiB6A9EWi0H275bENKi+9KXBpw@mail.gmail.com>
Message-ID: <42CAB628-F6A9-4526-8689-B5215AF42B70@gmail.com>


duplicate columns are a fact of life and we already have a mechanism for handling these 
i am not sure you will be able to reduce much here at all - clarifying the mechanisms is probably ok

we have discussed an io plugin mechanism previously and am supportive of api to allow external development
we could and should have the current io adapters use this - but i don?t believe we can achieve much of the benefits that you list - in fact complexity is likely to go up on CI 

in any event certainly the peps are the place to discuss this

> On Aug 25, 2022, at 6:19 AM, Marc Garcia <garcia.marc at gmail.com> wrote:
> 
> ?
> I'm planning to work on few PDEPs, and I think it's probably worth sharing the general ideas in advance, in case anyone has early feedback, wants to coordinate, or it's simply useful for others to know before the draft PDEPs are ready. Not intending to open the discussion on the details of them here, which it probably makes more sense to discuss in the draft on the PDEP when ready.
> 
> Governance: We discussed about defining exactly how a PDEP is approved while implementing them, and it's probably useful to clarify the project decision making in general. And also, I want to propose the creation of some workgroups within pandas to be more efficient managing things. Define their roles, how members are selected... Besides the existing CoC and NumFOCUS (finances) workgroups, I think it could be useful to have workgroups for the infrastructure (manage access to our servers, the github org, the distribution lists...), for the communications (tweeting, blogging, coordinate with NumFOCUS and other projects...). Maybe a release workgroup, and a steering committee if people find them useful...
> 
> Release: I think it would be useful to have a PDEP detailing the release process. Similar to our policies doc, but being more detailed and explicit on when exactly releases happen, and what's the criteria. Not necessarily change the current release policy, but I think we can better manage expectations of both users and developers, and be more efficient in the discussions if there is less ambiguity on when releases are going to happen.
> 
> IO modules as plugins: The idea would be to create a framework to extend pandas with IO functions, and move some of the current IO adaptors (e.g. stata, spss, feather...) to third-party projects. I think this would encourage people to build more pandas IO packages (for new formats, or improved versions of the existing), it would make these modules better maintained (easier to become a maintainer, maintainers experts in the formats...), and would reduce the complexity of pandas itself, the CI, reduce significantly the number of optional dependencies, and offer users a more natural way to install things (install pandas-sas, as opposed to have read_sas failing and having to install its dependencies).
> 
> Duplicate columns: pandas currently supports it, but it's not clear if there are many (or any) use cases where this is helpful, and this clearly makes pandas more complex, in its internals, and in its usability (e.g. almost all users would expect `df['my_col']` to return a Series, but no necessarily true, since there can be many columns "my_col", and it could return a DataFrame). I think it's useful to create a PDEP with advantages and actual use cases of having duplicate columns, and its cost in terms of problems and extra complexity. And make a decision on whether we want to continue supporting it.
> 
> Cheers,
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220825/e3176998/attachment-0001.html>

From rhshadrach at gmail.com  Fri Aug 26 15:16:01 2022
From: rhshadrach at gmail.com (Richard Shadrach)
Date: Fri, 26 Aug 2022 15:16:01 -0400
Subject: [Pandas-dev] pandas ASVs
Message-ID: <CAJ=56RLnOAyZLumapJsTkws1iu_-ypEZcdpaJR6d8c5u8sa60g@mail.gmail.com>

I noticed a bit ago that the ASVs on the pandas webpage were no longer
being updated.

https://pandas.pydata.org/speed/pandas/

Tom shared with me his setup and gave me permissions to
https://github.com/asv-runner (thanks Tom!). I now have a machine dedicated
to running the ASVs and updating the site here:

https://asv-runner.github.io/asv-collection/pandas/

I plan to be going through the regressions identified there and
notifying PRs as necessary; manually at first but then looking into a more
automated solution. I also plan to port Tom's work in asv-runner over to
Docker.

>From Marc, it sounds like there might be some changes coming to the
webserver, so I think it's best to wait until that dust settles before
getting the ASVs on the pandas webpage to auto-update.

Best,
Richard
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220826/e7f694fa/attachment.html>

From garcia.marc at gmail.com  Fri Aug 26 22:21:37 2022
From: garcia.marc at gmail.com (Marc Garcia)
Date: Sat, 27 Aug 2022 09:21:37 +0700
Subject: [Pandas-dev] pandas ASVs
In-Reply-To: <CAJ=56RLnOAyZLumapJsTkws1iu_-ypEZcdpaJR6d8c5u8sa60g@mail.gmail.com>
References: <CAJ=56RLnOAyZLumapJsTkws1iu_-ypEZcdpaJR6d8c5u8sa60g@mail.gmail.com>
Message-ID: <CAEk5N5vuyeh6KgCF+0AqKn1V+J4y24ybCYmrHdZzMPCAw-5DjA@mail.gmail.com>

Thanks Richard, really nice work on this front.

It took longer than expected, but I expect we should sign the new
partnership to get free hardware for pandas very soon now, probably early
next week. Only the final signatures are now missing.

With the new hosting, we will be moving our website as you say (the old
server is very expensive for what it is, and the new is free). But also we
can get dedicated hardware for benchmarks (and probably other things if
needed).

What I'd like to try is to set up a dedicated server as a github actions
worker, and set up a job in our CI with cirun to run benchmarks. Then test
two things:

- Timing: see if it could be feasible to run our benchmarks at every commit
like the rest of the CI. If this is possible, I think this would keep
things very simple, and catch all significant regressions easily before
they happen (also run for commits to main for the asv history)

- Stability: run the benchmarks for the same exact version of pandas many
times, and analyze the variance, understand well what's the noise in the
results, and see if we can take actions to reduce it to a minimum (fine
tunning OS settings...)

I don't know if a dedicated server as a githib actions worker with cirun
will be the best option, but I think it's worth trying the above, as it
should keep things simple and effective in my opinion. But we can surely
try other things and discuss other ideas, there may be better options.

On Sat, Aug 27, 2022, 02:16 Richard Shadrach <rhshadrach at gmail.com> wrote:

> I noticed a bit ago that the ASVs on the pandas webpage were no longer
> being updated.
>
> https://pandas.pydata.org/speed/pandas/
>
> Tom shared with me his setup and gave me permissions to
> https://github.com/asv-runner (thanks Tom!). I now have a machine
> dedicated to running the ASVs and updating the site here:
>
> https://asv-runner.github.io/asv-collection/pandas/
>
> I plan to be going through the regressions identified there and
> notifying PRs as necessary; manually at first but then looking into a more
> automated solution. I also plan to port Tom's work in asv-runner over to
> Docker.
>
> From Marc, it sounds like there might be some changes coming to the
> webserver, so I think it's best to wait until that dust settles before
> getting the ASVs on the pandas webpage to auto-update.
>
> Best,
> Richard
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220827/951a1b0b/attachment.html>

From nsh531 at gmail.com  Thu Aug 25 11:01:42 2022
From: nsh531 at gmail.com (=?UTF-8?B?16DXqteZINep15jXqNef?=)
Date: Thu, 25 Aug 2022 18:01:42 +0300
Subject: [Pandas-dev] [CrowdStrike/falconpy] Why I recieved this output:
 (Discussion #760)+pandas DataFrame error
In-Reply-To: <CAC8x_z0WPGcFd-yxRvLQsvcpXxk7czezMEwEcY8JAT=H7GVgVA@mail.gmail.com>
References: <CAC8x_z0WPGcFd-yxRvLQsvcpXxk7czezMEwEcY8JAT=H7GVgVA@mail.gmail.com>
Message-ID: <CAC8x_z3C6OoqSQC6wODyC95SY_Ruzn2VN3Q40xY3E7zx843trQ@mail.gmail.com>

2022-08-25 17:53 GMT?+03:00?, ??? ???? <nsh531 at gmail.com>:
> Someone here is understanding at pandas?
>
> ?????? ??? ?????, 25 ??????? 2022, ??? Joshua Hiller <
> notifications at github.com>:
>
>> This is pretty far out of scope, but reviewing the error message in your
>> IDE it appears you are passing scalars to the Dataframe. Since there is
>> no
>> index, you're given an error. Googling about, several folks who hit this
>> error pass the keyword index=[0] and provide a dummy index. Depending on
>> your goals, this may not be the correct solution.
>>
>> ?
>> Reply to this email directly, view it on GitHub
>> <https://github.com/CrowdStrike/falconpy/discussions/760#discussioncomment-3475300>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ABNBJWOHTGWD54YFK4D4MQTV257SJANCNFSM57SNFC5Q>
>> .
>> You are receiving this because you authored the thread.Message ID:
>> <CrowdStrike/falconpy/repo-discussions/760/comments/3475300 at github.com>
>>
>
>
> --
> <https://netanel.ml>
>


-- 
 <https://netanel.ml>

From nsh531 at gmail.com  Thu Aug 25 11:04:43 2022
From: nsh531 at gmail.com (=?UTF-8?B?16DXqteZINep15jXqNef?=)
Date: Thu, 25 Aug 2022 18:04:43 +0300
Subject: [Pandas-dev] Fwd: [CrowdStrike/falconpy] Why I recieved this
 output: (Discussion #760)+pandas DataFrame error
In-Reply-To: <CAC8x_z3C6OoqSQC6wODyC95SY_Ruzn2VN3Q40xY3E7zx843trQ@mail.gmail.com>
References: <CAC8x_z0WPGcFd-yxRvLQsvcpXxk7czezMEwEcY8JAT=H7GVgVA@mail.gmail.com>
 <CAC8x_z3C6OoqSQC6wODyC95SY_Ruzn2VN3Q40xY3E7zx843trQ@mail.gmail.com>
Message-ID: <CAC8x_z2dO7j7wUJKajMjvM66mKvMnNJeAdNj7fY1UVW5JUvsZg@mail.gmail.com>

---------- Forwarded message ----------
From: ??? ???? <nsh531 at gmail.com>
Date: Thu, 25 Aug 2022 18:01:42 +0300
Subject: Re: [CrowdStrike/falconpy] Why I recieved this output:
(Discussion #760)+pandas DataFrame error
To: CrowdStrike/falconpy
<reply+ABNBJWMFVWHNC7Q7O6EYV76BCS5CJEVBMXHAAQQ5QM at reply.github.com>,
python-list at python.org
Cc: CrowdStrike/falconpy <falconpy at noreply.github.com>, Author
<author at noreply.github.com>

2022-08-25 17:53 GMT?+03:00?, ??? ???? <nsh531 at gmail.com>:
> Someone here is understanding at pandas?
>
> ?????? ??? ?????, 25 ??????? 2022, ??? Joshua Hiller <
> notifications at github.com>:
>
>> This is pretty far out of scope, but reviewing the error message in your
>> IDE it appears you are passing scalars to the Dataframe. Since there is
>> no
>> index, you're given an error. Googling about, several folks who hit this
>> error pass the keyword index=[0] and provide a dummy index. Depending on
>> your goals, this may not be the correct solution.
>>
>> ?
>> Reply to this email directly, view it on GitHub
>> <https://github.com/CrowdStrike/falconpy/discussions/760#discussioncomment-3475300>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ABNBJWOHTGWD54YFK4D4MQTV257SJANCNFSM57SNFC5Q>
>> .
>> You are receiving this because you authored the thread.Message ID:
>> <CrowdStrike/falconpy/repo-discussions/760/comments/3475300 at github.com>
>>
>
>
> --
> <https://netanel.ml>
>


-- 
 <https://netanel.ml>


-- 
 <https://netanel.ml>

From nsh531 at gmail.com  Thu Aug 25 11:51:26 2022
From: nsh531 at gmail.com (=?UTF-8?B?16DXqteZINep15jXqNef?=)
Date: Thu, 25 Aug 2022 18:51:26 +0300
Subject: [Pandas-dev] [CrowdStrike/falconpy] Why I recieved this output:
 (Discussion #760)
In-Reply-To: <CrowdStrike/falconpy/repo-discussions/760/comments/3475675@github.com>
References: <CrowdStrike/falconpy/repo-discussions/760@github.com>
 <CrowdStrike/falconpy/repo-discussions/760/comments/3475675@github.com>
Message-ID: <CAC8x_z1MmNkZhGQBMWLw6e-UHaucCxEogrwSbL-7fZPO4gFEjQ@mail.gmail.com>

?????? ??? ?????, 25 ??????? 2022, ??? Joshua Hiller <
notifications at github.com>:

> Hi @NSH531 <https://github.com/NSH531> -
>
> If passing an index doesn't help, you can probably get more assistance
> from the Pandas issue board located here: https://github.com/pandas-dev/
> pandas/issues
>
> ?
> Reply to this email directly, view it on GitHub
> <https://github.com/CrowdStrike/falconpy/discussions/760#discussioncomment-3475675>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ABNBJWNTMFYE4FBSUJ5FGM3V26EJRANCNFSM57SNFC5Q>
> .
> You are receiving this because you were mentioned.Message ID:
> <CrowdStrike/falconpy/repo-discussions/760/comments/3475675 at github.com>
>


-- 
<https://netanel.ml>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220825/3caa1d04/attachment.html>

From rhshadrach at gmail.com  Tue Aug 30 16:59:17 2022
From: rhshadrach at gmail.com (Richard Shadrach)
Date: Tue, 30 Aug 2022 16:59:17 -0400
Subject: [Pandas-dev] pandas ASVs
In-Reply-To: <CAEk5N5vuyeh6KgCF+0AqKn1V+J4y24ybCYmrHdZzMPCAw-5DjA@mail.gmail.com>
References: <CAJ=56RLnOAyZLumapJsTkws1iu_-ypEZcdpaJR6d8c5u8sa60g@mail.gmail.com>
 <CAEk5N5vuyeh6KgCF+0AqKn1V+J4y24ybCYmrHdZzMPCAw-5DjA@mail.gmail.com>
Message-ID: <CAJ=56RJg7yEcrdHQ2Ef+HEpMFs03oLFD60VbBS9JeN200bk8VA@mail.gmail.com>

Thanks Marc! It would be 4-6 hours on the CI to run on each PR; we
currently spend ~27 hours so it's not too huge of an increase. I haven't
seen it get backed up since we got the increase in workers from GitHub (but
also haven't paid too much attention). I can put up a PR trying this out
assuming I hear no objections in a few weeks. I do agree this would be
beneficial, but there are a few downsides, minor in my opinion.

 - As you indicated, there will be false positives. How detrimental they
are remains to be seen. I suspect that many will be obvious (e.g. a PR
touching groupby shows a regression in excel code).
 - You don't get to see history, in particular if a function has several
minor degradations in performance that add up.
 - You don't get to see historical variability that can help in determining
false positives, e.g.
https://asv-runner.github.io/asv-collection/pandas/#arithmetic.BinaryOpsMultiIndex.time_binary_op_multiindex

If we're able to add a GHA to run the benchmarks on every PR, then I think
we could also run them on every commit to main and upload the results
somewhere - effectively having a benchmark machine but run in GitHub. As I
believe you're thinking, you'll want a consistent machine for this. If we
could get that, it would eliminate the latter two bullet points.

Best,
Richard

On Fri, Aug 26, 2022 at 10:21 PM Marc Garcia <garcia.marc at gmail.com> wrote:

> Thanks Richard, really nice work on this front.
>
> It took longer than expected, but I expect we should sign the new
> partnership to get free hardware for pandas very soon now, probably early
> next week. Only the final signatures are now missing.
>
> With the new hosting, we will be moving our website as you say (the old
> server is very expensive for what it is, and the new is free). But also we
> can get dedicated hardware for benchmarks (and probably other things if
> needed).
>
> What I'd like to try is to set up a dedicated server as a github actions
> worker, and set up a job in our CI with cirun to run benchmarks. Then test
> two things:
>
> - Timing: see if it could be feasible to run our benchmarks at every
> commit like the rest of the CI. If this is possible, I think this would
> keep things very simple, and catch all significant regressions easily
> before they happen (also run for commits to main for the asv history)
>
> - Stability: run the benchmarks for the same exact version of pandas many
> times, and analyze the variance, understand well what's the noise in the
> results, and see if we can take actions to reduce it to a minimum (fine
> tunning OS settings...)
>
> I don't know if a dedicated server as a githib actions worker with cirun
> will be the best option, but I think it's worth trying the above, as it
> should keep things simple and effective in my opinion. But we can surely
> try other things and discuss other ideas, there may be better options.
>
> On Sat, Aug 27, 2022, 02:16 Richard Shadrach <rhshadrach at gmail.com> wrote:
>
>> I noticed a bit ago that the ASVs on the pandas webpage were no longer
>> being updated.
>>
>> https://pandas.pydata.org/speed/pandas/
>>
>> Tom shared with me his setup and gave me permissions to
>> https://github.com/asv-runner (thanks Tom!). I now have a machine
>> dedicated to running the ASVs and updating the site here:
>>
>> https://asv-runner.github.io/asv-collection/pandas/
>>
>> I plan to be going through the regressions identified there and
>> notifying PRs as necessary; manually at first but then looking into a more
>> automated solution. I also plan to port Tom's work in asv-runner over to
>> Docker.
>>
>> From Marc, it sounds like there might be some changes coming to the
>> webserver, so I think it's best to wait until that dust settles before
>> getting the ASVs on the pandas webpage to auto-update.
>>
>> Best,
>> Richard
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220830/9f20f03d/attachment.html>

From garcia.marc at gmail.com  Tue Aug 30 18:03:39 2022
From: garcia.marc at gmail.com (Marc Garcia)
Date: Wed, 31 Aug 2022 00:03:39 +0200
Subject: [Pandas-dev] pandas ASVs
In-Reply-To: <CAJ=56RJg7yEcrdHQ2Ef+HEpMFs03oLFD60VbBS9JeN200bk8VA@mail.gmail.com>
References: <CAJ=56RLnOAyZLumapJsTkws1iu_-ypEZcdpaJR6d8c5u8sa60g@mail.gmail.com>
 <CAEk5N5vuyeh6KgCF+0AqKn1V+J4y24ybCYmrHdZzMPCAw-5DjA@mail.gmail.com>
 <CAJ=56RJg7yEcrdHQ2Ef+HEpMFs03oLFD60VbBS9JeN200bk8VA@mail.gmail.com>
Message-ID: <CAEk5N5uv8oaXRcvhraqZJh8jrsr6s=yLuYvARjWFr3eRQDBCqQ@mail.gmail.com>

I don't think I was very clear, but yes, I have in mind the two use cases:

- For PRs compare the run before and after and flag when there are
regressions
- For commits to main, keep the history with consistent hardware

One of the main variables to know how much hardware this requires is how
many times each benchmark needs to run. At the moments for PRs we already
run benchmarks, with just one call each, and no problem with out CI. If we
can keep the number very slow with CPU isolation and other noise reduction
techniques, I think the above should be feasible. It's not clear to me how
many runs asv does by default.

I still don't have the dedicated server I'm expecting for the benchmarks,
but we should have it in the next few days. I'll ping you when I have it,
so we can do some tests. Depending on the hardware we get, the timings will
also change significantly (I think the new benchmarks server should be
significantly faster than the old one, so job times should ve much lower).

On Tue, Aug 30, 2022, 22:59 Richard Shadrach <rhshadrach at gmail.com> wrote:

> Thanks Marc! It would be 4-6 hours on the CI to run on each PR; we
> currently spend ~27 hours so it's not too huge of an increase. I haven't
> seen it get backed up since we got the increase in workers from GitHub (but
> also haven't paid too much attention). I can put up a PR trying this out
> assuming I hear no objections in a few weeks. I do agree this would be
> beneficial, but there are a few downsides, minor in my opinion.
>
>  - As you indicated, there will be false positives. How detrimental they
> are remains to be seen. I suspect that many will be obvious (e.g. a PR
> touching groupby shows a regression in excel code).
>  - You don't get to see history, in particular if a function has several
> minor degradations in performance that add up.
>  - You don't get to see historical variability that can help in
> determining false positives, e.g.
> https://asv-runner.github.io/asv-collection/pandas/#arithmetic.BinaryOpsMultiIndex.time_binary_op_multiindex
>
> If we're able to add a GHA to run the benchmarks on every PR, then I think
> we could also run them on every commit to main and upload the results
> somewhere - effectively having a benchmark machine but run in GitHub. As I
> believe you're thinking, you'll want a consistent machine for this. If we
> could get that, it would eliminate the latter two bullet points.
>
> Best,
> Richard
>
> On Fri, Aug 26, 2022 at 10:21 PM Marc Garcia <garcia.marc at gmail.com>
> wrote:
>
>> Thanks Richard, really nice work on this front.
>>
>> It took longer than expected, but I expect we should sign the new
>> partnership to get free hardware for pandas very soon now, probably early
>> next week. Only the final signatures are now missing.
>>
>> With the new hosting, we will be moving our website as you say (the old
>> server is very expensive for what it is, and the new is free). But also we
>> can get dedicated hardware for benchmarks (and probably other things if
>> needed).
>>
>> What I'd like to try is to set up a dedicated server as a github actions
>> worker, and set up a job in our CI with cirun to run benchmarks. Then test
>> two things:
>>
>> - Timing: see if it could be feasible to run our benchmarks at every
>> commit like the rest of the CI. If this is possible, I think this would
>> keep things very simple, and catch all significant regressions easily
>> before they happen (also run for commits to main for the asv history)
>>
>> - Stability: run the benchmarks for the same exact version of pandas many
>> times, and analyze the variance, understand well what's the noise in the
>> results, and see if we can take actions to reduce it to a minimum (fine
>> tunning OS settings...)
>>
>> I don't know if a dedicated server as a githib actions worker with cirun
>> will be the best option, but I think it's worth trying the above, as it
>> should keep things simple and effective in my opinion. But we can surely
>> try other things and discuss other ideas, there may be better options.
>>
>> On Sat, Aug 27, 2022, 02:16 Richard Shadrach <rhshadrach at gmail.com>
>> wrote:
>>
>>> I noticed a bit ago that the ASVs on the pandas webpage were no longer
>>> being updated.
>>>
>>> https://pandas.pydata.org/speed/pandas/
>>>
>>> Tom shared with me his setup and gave me permissions to
>>> https://github.com/asv-runner (thanks Tom!). I now have a machine
>>> dedicated to running the ASVs and updating the site here:
>>>
>>> https://asv-runner.github.io/asv-collection/pandas/
>>>
>>> I plan to be going through the regressions identified there and
>>> notifying PRs as necessary; manually at first but then looking into a more
>>> automated solution. I also plan to port Tom's work in asv-runner over to
>>> Docker.
>>>
>>> From Marc, it sounds like there might be some changes coming to the
>>> webserver, so I think it's best to wait until that dust settles before
>>> getting the ASVs on the pandas webpage to auto-update.
>>>
>>> Best,
>>> Richard
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220831/55b1a83f/attachment-0001.html>