From emailformattr at gmail.com Tue Jun 4 01:58:59 2019 From: emailformattr at gmail.com (Matthew Roeschke) Date: Mon, 3 Jun 2019 22:58:59 -0700 Subject: [Pandas-dev] Pandas dev sprint June 27-30 @ Nashville In-Reply-To: References: Message-ID: Wes, any tips where in Nashville we should book accommodations or generally where the co-working space would be? I'm looking to book my accomodations fairly soon. Thanks. On Thu, May 30, 2019 at 10:57 AM Wes McKinney wrote: > @Joris if you have a complete headcount for the meeting please let me > know so I can work on securing a space for us to work. Just to confirm > it's the 27th through the 30th inclusive, so 4 full days of workspace > required? > > Thanks > > On Sat, May 25, 2019 at 3:44 PM Chang She wrote: > > > > Oh this was more me just carving time out as a forcing function for > myself. I?ll be +13 hour ahead so certainly not expecting video > conferences. Code reviews would be appreciated. > > > > As for the read_parquet item, I will either attach to an existing issue > or open a new one so discussion can happen on github. > > > > Thanks. > > > > On Saturday, May 25, 2019, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> > >> You are certainly welcome to sprint those days as well. I only can't > promise that we can do any things to improve remote participation such as > video meetings (but there should be a bunch of core devs active, which > might give faster feedback on PRs if needed). > >> > >> Op di 14 mei 2019 om 22:59 schreef Chang She : > >>> > >>> I'll be out of the country during those dates. Can I still join in > remotely? > >>> > >>> Here's what I'd be interested in working on if there's appetite for > these to be part of pandas and friends: > >>> > >>> 1. A stale PR on Series.explode I haven't had any time to finish up ( > https://github.com/pandas-dev/pandas/pull/24366). > >> > >> > >> I think there is still certainly interest in such a function (I have > regularly needed something like that myself). > >> > >>> > >>> 2. Open sourcing an improvement to the pandas-redshift connector that > speeds up the ingestion of medium amounts of data using a combination of > unload + read_csv + multiprocessing. > >> > >> > >> That reminds me: it might be good to mention the pandas-redshift > packages somewhere in the docs (in the ecosystem page or in the sql docs). > >> > >>> > >>> 3. A minor improvement to allow read_parquet to work with globs > directly. This makes it a lot easier for pandas to read parquet generated > by Spark. > >>> > >> Do you know if there is already an open issue about this? > >> > >> Best, > >> Joris > >> > >>> > >>> > >>> > >>> On Tue, May 14, 2019 at 4:50 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >>>> > >>>> Dear all, > >>>> > >>>> We are planning to do a pandas sprint end of June in Nashville > (Tennessee, USA): June 27-30. We will be meeting with some of the core devs > (so not a sprint to jump-start newcomers in this case), but sending this to > the mailing list to invite other pandas (or related libraries) contributors. > >>>> The exact planning of the sprint still needs to be discussed, but we > will probably be hacking and discussing on pandas, extension arrays, next > versions of pandas, etc. > >>>> > >>>> So if you are interested, let me know something! We want to keep the > number of participants somewhat limited, and also need to plan the location > and funding, so please state your interest before May 30. > >>>> If you would like to participate, but not sure if you would fit at > such a sprint, don't hesitate to mail me personally. > >>>> > >>>> Best, > >>>> Joris > >>>> > >>>> > >>>> _______________________________________________ > >>>> Pandas-dev mailing list > >>>> Pandas-dev at python.org > >>>> https://mail.python.org/mailman/listinfo/pandas-dev > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -- Matthew Roeschke -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Jun 4 08:50:35 2019 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 4 Jun 2019 07:50:35 -0500 Subject: [Pandas-dev] Pandas dev sprint June 27-30 @ Nashville In-Reply-To: References: Message-ID: Yes, I'd recommend within reach of East Nashville or Downtown Nashville. The earlier you can book accommodations the better! I don't have an accurate headcount yet so I haven't been able to arrange the location to meet yet. On Tue, Jun 4, 2019 at 12:59 AM Matthew Roeschke wrote: > > Wes, any tips where in Nashville we should book accommodations or generally where the co-working space would be? I'm looking to book my accomodations fairly soon. > > Thanks. > > On Thu, May 30, 2019 at 10:57 AM Wes McKinney wrote: >> >> @Joris if you have a complete headcount for the meeting please let me >> know so I can work on securing a space for us to work. Just to confirm >> it's the 27th through the 30th inclusive, so 4 full days of workspace >> required? >> >> Thanks >> >> On Sat, May 25, 2019 at 3:44 PM Chang She wrote: >> > >> > Oh this was more me just carving time out as a forcing function for myself. I?ll be +13 hour ahead so certainly not expecting video conferences. Code reviews would be appreciated. >> > >> > As for the read_parquet item, I will either attach to an existing issue or open a new one so discussion can happen on github. >> > >> > Thanks. >> > >> > On Saturday, May 25, 2019, Joris Van den Bossche wrote: >> >> >> >> You are certainly welcome to sprint those days as well. I only can't promise that we can do any things to improve remote participation such as video meetings (but there should be a bunch of core devs active, which might give faster feedback on PRs if needed). >> >> >> >> Op di 14 mei 2019 om 22:59 schreef Chang She : >> >>> >> >>> I'll be out of the country during those dates. Can I still join in remotely? >> >>> >> >>> Here's what I'd be interested in working on if there's appetite for these to be part of pandas and friends: >> >>> >> >>> 1. A stale PR on Series.explode I haven't had any time to finish up (https://github.com/pandas-dev/pandas/pull/24366). >> >> >> >> >> >> I think there is still certainly interest in such a function (I have regularly needed something like that myself). >> >> >> >>> >> >>> 2. Open sourcing an improvement to the pandas-redshift connector that speeds up the ingestion of medium amounts of data using a combination of unload + read_csv + multiprocessing. >> >> >> >> >> >> That reminds me: it might be good to mention the pandas-redshift packages somewhere in the docs (in the ecosystem page or in the sql docs). >> >> >> >>> >> >>> 3. A minor improvement to allow read_parquet to work with globs directly. This makes it a lot easier for pandas to read parquet generated by Spark. >> >>> >> >> Do you know if there is already an open issue about this? >> >> >> >> Best, >> >> Joris >> >> >> >>> >> >>> >> >>> >> >>> On Tue, May 14, 2019 at 4:50 AM Joris Van den Bossche wrote: >> >>>> >> >>>> Dear all, >> >>>> >> >>>> We are planning to do a pandas sprint end of June in Nashville (Tennessee, USA): June 27-30. We will be meeting with some of the core devs (so not a sprint to jump-start newcomers in this case), but sending this to the mailing list to invite other pandas (or related libraries) contributors. >> >>>> The exact planning of the sprint still needs to be discussed, but we will probably be hacking and discussing on pandas, extension arrays, next versions of pandas, etc. >> >>>> >> >>>> So if you are interested, let me know something! We want to keep the number of participants somewhat limited, and also need to plan the location and funding, so please state your interest before May 30. >> >>>> If you would like to participate, but not sure if you would fit at such a sprint, don't hesitate to mail me personally. >> >>>> >> >>>> Best, >> >>>> Joris >> >>>> >> >>>> >> >>>> _______________________________________________ >> >>>> Pandas-dev mailing list >> >>>> Pandas-dev at python.org >> >>>> https://mail.python.org/mailman/listinfo/pandas-dev >> > >> > _______________________________________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > https://mail.python.org/mailman/listinfo/pandas-dev >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > -- > Matthew Roeschke > From jorisvandenbossche at gmail.com Fri Jun 7 12:53:24 2019 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 7 Jun 2019 18:53:24 +0200 Subject: [Pandas-dev] Tidelift Message-ID: Hi all, We discussed this on the last dev chat, but putting it on the mailing list for those who were not present: we are planning to contact Tidelift to enter into a sponsor agreement for Pandas. The idea is to follow what NumPy (and recently also Scipy) did to have an agreement between Tidelift and NumFOCUS instead of an individual maintainer (see their announcement mail: https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html). Blog with overview about Tidelift: https://blog.tidelift .com/how-to-start-earning-money-for-your-open-source-project-with-tidelift. We didn't discuss yet what to do specifically with those funds, that should still be discussed in the future. Cheers, Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From william.ayd at icloud.com Sat Jun 8 16:54:08 2019 From: william.ayd at icloud.com (William Ayd) Date: Sat, 8 Jun 2019 16:54:08 -0400 Subject: [Pandas-dev] Tidelift In-Reply-To: References: Message-ID: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> What is the minimum amount we are asking for? The $1,000 a month for NumPy seems rather low and I thought previous emails had something in the range of $3k a month. I don?t think we necessarily need or would be that much improved by $12k per year so would rather aim higher if we are going to do this > On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche wrote: > > Hi all, > > We discussed this on the last dev chat, but putting it on the mailing list for those who were not present: we are planning to contact Tidelift to enter into a sponsor agreement for Pandas. > > The idea is to follow what NumPy (and recently also Scipy) did to have an agreement between Tidelift and NumFOCUS instead of an individual maintainer (see their announcement mail: https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html ). > Blog with overview about Tidelift: https://blog.tidelift.com/how-to-start-earning-money-for-your-open-source-project-with-tidelift . > > We didn't discuss yet what to do specifically with those funds, that should still be discussed in the future. > > Cheers, > Joris > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From andres.gutierrez.arcia at gmail.com Mon Jun 10 13:23:07 2019 From: andres.gutierrez.arcia at gmail.com (=?UTF-8?B?QW5kcsOpcyBHdXRpw6lycmV6?=) Date: Mon, 10 Jun 2019 11:23:07 -0600 Subject: [Pandas-dev] Panel on 0.25 Message-ID: Hi, what are the plans for Panel objects for the next release Pandas 0.25? Are you going to keep it or definitely drop it? if so, are you going to store it in a particular Python package? Thanks! -- *| Andr?s Guti?rrez Arcia* *| andres.gutierrez.arcia at gmail.com * | https://github.com/andrsGutirrz | https://www.linkedin.com/in/andres-gutierrez-arcia -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jun 10 14:01:31 2019 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 10 Jun 2019 13:01:31 -0500 Subject: [Pandas-dev] Pandas dev sprint June 27-30 @ Nashville In-Reply-To: References: Message-ID: To confirm, the location for the sprints is in Downtown Nashville near the state capitol building. Specific details about the location will be shared non-publicly with sprint participants On Tue, Jun 4, 2019 at 7:50 AM Wes McKinney wrote: > > Yes, I'd recommend within reach of East Nashville or Downtown > Nashville. The earlier you can book accommodations the better! > > I don't have an accurate headcount yet so I haven't been able to > arrange the location to meet yet. > > On Tue, Jun 4, 2019 at 12:59 AM Matthew Roeschke > wrote: > > > > Wes, any tips where in Nashville we should book accommodations or generally where the co-working space would be? I'm looking to book my accomodations fairly soon. > > > > Thanks. > > > > On Thu, May 30, 2019 at 10:57 AM Wes McKinney wrote: > >> > >> @Joris if you have a complete headcount for the meeting please let me > >> know so I can work on securing a space for us to work. Just to confirm > >> it's the 27th through the 30th inclusive, so 4 full days of workspace > >> required? > >> > >> Thanks > >> > >> On Sat, May 25, 2019 at 3:44 PM Chang She wrote: > >> > > >> > Oh this was more me just carving time out as a forcing function for myself. I?ll be +13 hour ahead so certainly not expecting video conferences. Code reviews would be appreciated. > >> > > >> > As for the read_parquet item, I will either attach to an existing issue or open a new one so discussion can happen on github. > >> > > >> > Thanks. > >> > > >> > On Saturday, May 25, 2019, Joris Van den Bossche wrote: > >> >> > >> >> You are certainly welcome to sprint those days as well. I only can't promise that we can do any things to improve remote participation such as video meetings (but there should be a bunch of core devs active, which might give faster feedback on PRs if needed). > >> >> > >> >> Op di 14 mei 2019 om 22:59 schreef Chang She : > >> >>> > >> >>> I'll be out of the country during those dates. Can I still join in remotely? > >> >>> > >> >>> Here's what I'd be interested in working on if there's appetite for these to be part of pandas and friends: > >> >>> > >> >>> 1. A stale PR on Series.explode I haven't had any time to finish up (https://github.com/pandas-dev/pandas/pull/24366). > >> >> > >> >> > >> >> I think there is still certainly interest in such a function (I have regularly needed something like that myself). > >> >> > >> >>> > >> >>> 2. Open sourcing an improvement to the pandas-redshift connector that speeds up the ingestion of medium amounts of data using a combination of unload + read_csv + multiprocessing. > >> >> > >> >> > >> >> That reminds me: it might be good to mention the pandas-redshift packages somewhere in the docs (in the ecosystem page or in the sql docs). > >> >> > >> >>> > >> >>> 3. A minor improvement to allow read_parquet to work with globs directly. This makes it a lot easier for pandas to read parquet generated by Spark. > >> >>> > >> >> Do you know if there is already an open issue about this? > >> >> > >> >> Best, > >> >> Joris > >> >> > >> >>> > >> >>> > >> >>> > >> >>> On Tue, May 14, 2019 at 4:50 AM Joris Van den Bossche wrote: > >> >>>> > >> >>>> Dear all, > >> >>>> > >> >>>> We are planning to do a pandas sprint end of June in Nashville (Tennessee, USA): June 27-30. We will be meeting with some of the core devs (so not a sprint to jump-start newcomers in this case), but sending this to the mailing list to invite other pandas (or related libraries) contributors. > >> >>>> The exact planning of the sprint still needs to be discussed, but we will probably be hacking and discussing on pandas, extension arrays, next versions of pandas, etc. > >> >>>> > >> >>>> So if you are interested, let me know something! We want to keep the number of participants somewhat limited, and also need to plan the location and funding, so please state your interest before May 30. > >> >>>> If you would like to participate, but not sure if you would fit at such a sprint, don't hesitate to mail me personally. > >> >>>> > >> >>>> Best, > >> >>>> Joris > >> >>>> > >> >>>> > >> >>>> _______________________________________________ > >> >>>> Pandas-dev mailing list > >> >>>> Pandas-dev at python.org > >> >>>> https://mail.python.org/mailman/listinfo/pandas-dev > >> > > >> > _______________________________________________ > >> > Pandas-dev mailing list > >> > Pandas-dev at python.org > >> > https://mail.python.org/mailman/listinfo/pandas-dev > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > > > > -- > > Matthew Roeschke > > From tom.augspurger88 at gmail.com Mon Jun 10 14:03:15 2019 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Mon, 10 Jun 2019 13:03:15 -0500 Subject: [Pandas-dev] Panel on 0.25 In-Reply-To: References: Message-ID: Panel will be removed in 0.25 (it's gone on master). No plans to move the functionality elsewhere. I'd recommend pinning to 0.24.x if you need Panel. On Mon, Jun 10, 2019 at 12:52 PM Andr?s Guti?rrez < andres.gutierrez.arcia at gmail.com> wrote: > Hi, > > what are the plans for Panel objects for the next release Pandas 0.25? > Are you going to keep it or definitely drop it? if so, are you going to > store it in a particular Python package? > > Thanks! > > -- > *| Andr?s Guti?rrez Arcia* > *| andres.gutierrez.arcia at gmail.com * > | https://github.com/andrsGutirrz > | https://www.linkedin.com/in/andres-gutierrez-arcia > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jun 10 14:08:08 2019 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 10 Jun 2019 13:08:08 -0500 Subject: [Pandas-dev] Panel on 0.25 In-Reply-To: References: Message-ID: Note that if there are community members who wish to fork the Panel code and maintain it as a separate project, you are free to do so. The consensus of the core team has been for some time to no longer invest energy in maintaining this code If you were going to go down this road, it might be worth questioning whether you'd be better off contributing to xarray instead, of building a composite data structure that uses xarray under the hood On Mon, Jun 10, 2019 at 1:03 PM Tom Augspurger wrote: > > Panel will be removed in 0.25 (it's gone on master). > > No plans to move the functionality elsewhere. I'd recommend pinning to 0.24.x if you need Panel. > > On Mon, Jun 10, 2019 at 12:52 PM Andr?s Guti?rrez wrote: >> >> Hi, >> >> what are the plans for Panel objects for the next release Pandas 0.25? >> Are you going to keep it or definitely drop it? if so, are you going to store it in a particular Python package? >> >> Thanks! >> >> -- >> | Andr?s Guti?rrez Arcia >> | andres.gutierrez.arcia at gmail.com >> | https://github.com/andrsGutirrz >> | https://www.linkedin.com/in/andres-gutierrez-arcia >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From andres.gutierrez.arcia at gmail.com Mon Jun 10 14:26:12 2019 From: andres.gutierrez.arcia at gmail.com (=?UTF-8?B?QW5kcsOpcyBHdXRpw6lycmV6?=) Date: Mon, 10 Jun 2019 12:26:12 -0600 Subject: [Pandas-dev] Panel on 0.25 In-Reply-To: References: Message-ID: Thank you guys! I was asking because Im dealing with legacy code that uses Panel, and there are a lot of warnings! Im using a multiindex dataframe to avoid Panel! Thanks for your El lun., 10 de jun. de 2019 a la(s) 12:08, Wes McKinney (wesmckinn at gmail.com) escribi?: > Note that if there are community members who wish to fork the Panel > code and maintain it as a separate project, you are free to do so. The > consensus of the core team has been for some time to no longer invest > energy in maintaining this code > > If you were going to go down this road, it might be worth questioning > whether you'd be better off contributing to xarray instead, of > building a composite data structure that uses xarray under the hood > > On Mon, Jun 10, 2019 at 1:03 PM Tom Augspurger > wrote: > > > > Panel will be removed in 0.25 (it's gone on master). > > > > No plans to move the functionality elsewhere. I'd recommend pinning to > 0.24.x if you need Panel. > > > > On Mon, Jun 10, 2019 at 12:52 PM Andr?s Guti?rrez < > andres.gutierrez.arcia at gmail.com> wrote: > >> > >> Hi, > >> > >> what are the plans for Panel objects for the next release Pandas 0.25? > >> Are you going to keep it or definitely drop it? if so, are you going to > store it in a particular Python package? > >> > >> Thanks! > >> > >> -- > >> | Andr?s Guti?rrez Arcia > >> | andres.gutierrez.arcia at gmail.com > >> | https://github.com/andrsGutirrz > >> | https://www.linkedin.com/in/andres-gutierrez-arcia > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > -- *| Andr?s Guti?rrez Arcia* *| Estudiante UNA-CR* *| andres.gutierrez.arcia at gmail.com * | https://github.com/andrsGutirrz | https://www.linkedin.com/in/andres-gutierrez-arcia *| 61688613* -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Mon Jun 10 19:56:13 2019 From: jbrockmendel at gmail.com (Brock Mendel) Date: Mon, 10 Jun 2019 18:56:13 -0500 Subject: [Pandas-dev] Pandas dev sprint June 27-30 @ Nashville In-Reply-To: References: Message-ID: Is there an agenda document somewhere? On Mon, Jun 10, 2019 at 1:02 PM Wes McKinney wrote: > To confirm, the location for the sprints is in Downtown Nashville near > the state capitol building. Specific details about the location will > be shared non-publicly with sprint participants > > On Tue, Jun 4, 2019 at 7:50 AM Wes McKinney wrote: > > > > Yes, I'd recommend within reach of East Nashville or Downtown > > Nashville. The earlier you can book accommodations the better! > > > > I don't have an accurate headcount yet so I haven't been able to > > arrange the location to meet yet. > > > > On Tue, Jun 4, 2019 at 12:59 AM Matthew Roeschke > > wrote: > > > > > > Wes, any tips where in Nashville we should book accommodations or > generally where the co-working space would be? I'm looking to book my > accomodations fairly soon. > > > > > > Thanks. > > > > > > On Thu, May 30, 2019 at 10:57 AM Wes McKinney > wrote: > > >> > > >> @Joris if you have a complete headcount for the meeting please let me > > >> know so I can work on securing a space for us to work. Just to confirm > > >> it's the 27th through the 30th inclusive, so 4 full days of workspace > > >> required? > > >> > > >> Thanks > > >> > > >> On Sat, May 25, 2019 at 3:44 PM Chang She wrote: > > >> > > > >> > Oh this was more me just carving time out as a forcing function for > myself. I?ll be +13 hour ahead so certainly not expecting video > conferences. Code reviews would be appreciated. > > >> > > > >> > As for the read_parquet item, I will either attach to an existing > issue or open a new one so discussion can happen on github. > > >> > > > >> > Thanks. > > >> > > > >> > On Saturday, May 25, 2019, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > > >> >> > > >> >> You are certainly welcome to sprint those days as well. I only > can't promise that we can do any things to improve remote participation > such as video meetings (but there should be a bunch of core devs active, > which might give faster feedback on PRs if needed). > > >> >> > > >> >> Op di 14 mei 2019 om 22:59 schreef Chang She : > > >> >>> > > >> >>> I'll be out of the country during those dates. Can I still join > in remotely? > > >> >>> > > >> >>> Here's what I'd be interested in working on if there's appetite > for these to be part of pandas and friends: > > >> >>> > > >> >>> 1. A stale PR on Series.explode I haven't had any time to finish > up (https://github.com/pandas-dev/pandas/pull/24366). > > >> >> > > >> >> > > >> >> I think there is still certainly interest in such a function (I > have regularly needed something like that myself). > > >> >> > > >> >>> > > >> >>> 2. Open sourcing an improvement to the pandas-redshift connector > that speeds up the ingestion of medium amounts of data using a combination > of unload + read_csv + multiprocessing. > > >> >> > > >> >> > > >> >> That reminds me: it might be good to mention the pandas-redshift > packages somewhere in the docs (in the ecosystem page or in the sql docs). > > >> >> > > >> >>> > > >> >>> 3. A minor improvement to allow read_parquet to work with globs > directly. This makes it a lot easier for pandas to read parquet generated > by Spark. > > >> >>> > > >> >> Do you know if there is already an open issue about this? > > >> >> > > >> >> Best, > > >> >> Joris > > >> >> > > >> >>> > > >> >>> > > >> >>> > > >> >>> On Tue, May 14, 2019 at 4:50 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > > >> >>>> > > >> >>>> Dear all, > > >> >>>> > > >> >>>> We are planning to do a pandas sprint end of June in Nashville > (Tennessee, USA): June 27-30. We will be meeting with some of the core devs > (so not a sprint to jump-start newcomers in this case), but sending this to > the mailing list to invite other pandas (or related libraries) contributors. > > >> >>>> The exact planning of the sprint still needs to be discussed, > but we will probably be hacking and discussing on pandas, extension arrays, > next versions of pandas, etc. > > >> >>>> > > >> >>>> So if you are interested, let me know something! We want to keep > the number of participants somewhat limited, and also need to plan the > location and funding, so please state your interest before May 30. > > >> >>>> If you would like to participate, but not sure if you would fit > at such a sprint, don't hesitate to mail me personally. > > >> >>>> > > >> >>>> Best, > > >> >>>> Joris > > >> >>>> > > >> >>>> > > >> >>>> _______________________________________________ > > >> >>>> Pandas-dev mailing list > > >> >>>> Pandas-dev at python.org > > >> >>>> https://mail.python.org/mailman/listinfo/pandas-dev > > >> > > > >> > _______________________________________________ > > >> > Pandas-dev mailing list > > >> > Pandas-dev at python.org > > >> > https://mail.python.org/mailman/listinfo/pandas-dev > > >> _______________________________________________ > > >> Pandas-dev mailing list > > >> Pandas-dev at python.org > > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > > > > > > > > -- > > > Matthew Roeschke > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Jun 11 04:15:30 2019 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 11 Jun 2019 10:15:30 +0200 Subject: [Pandas-dev] Tidelift In-Reply-To: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: The current page about pandas ( https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 dollar a month (but I am not fully sure this is what is already available from their current subscribers, or if it is a prospect). Op za 8 jun. 2019 om 22:54 schreef William Ayd : > What is the minimum amount we are asking for? The $1,000 a month for NumPy > seems rather low and I thought previous emails had something in the range > of $3k a month. > > I don?t think we necessarily need or would be that much improved by $12k > per year so would rather aim higher if we are going to do this > > On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > > Hi all, > > We discussed this on the last dev chat, but putting it on the mailing list > for those who were not present: we are planning to contact Tidelift to > enter into a sponsor agreement for Pandas. > > The idea is to follow what NumPy (and recently also Scipy) did to have an > agreement between Tidelift and NumFOCUS instead of an individual maintainer > (see their announcement mail: > https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html > ). > Blog with overview about Tidelift: https://blog.tidelift > .com/how-to-start-earning-money-for-your-open-source-project-with-tidelift > . > > We didn't discuss yet what to do specifically with those funds, that > should still be discussed in the future. > > Cheers, > Joris > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Tue Jun 11 04:44:38 2019 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Tue, 11 Jun 2019 10:44:38 +0200 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > The current page about pandas ( > https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 dollar a > month (but I am not fully sure this is what is already available from their > current subscribers, or if it is a prospect). > It's not just a prospect, that's what you should/will get. NumPy and SciPy get the listed amounts too. Agreed that the NumPy amount is not that much. The amount gets determined automatically; it's some combination of customer interest, dependency analysis and size of the API surface. The current amounts are: NumPy: $1000 SciPy: $2500 Pandas: $3000 Matplotlib: n.a. Scikit-learn: $1500 Scikit-image: $50 Statsmodels: $50 So there's an element of randomness, but the results are not completely surprising I think. The four libraries that get order thousands of dollars are the ones that large corporations are going to have the highest interest in. Cheers, Ralf > > Op za 8 jun. 2019 om 22:54 schreef William Ayd : > >> What is the minimum amount we are asking for? The $1,000 a month for >> NumPy seems rather low and I thought previous emails had something in the >> range of $3k a month. >> >> I don?t think we necessarily need or would be that much improved by $12k >> per year so would rather aim higher if we are going to do this >> >> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >> Hi all, >> >> We discussed this on the last dev chat, but putting it on the mailing >> list for those who were not present: we are planning to contact Tidelift to >> enter into a sponsor agreement for Pandas. >> >> The idea is to follow what NumPy (and recently also Scipy) did to have an >> agreement between Tidelift and NumFOCUS instead of an individual maintainer >> (see their announcement mail: >> https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html >> ). >> Blog with overview about Tidelift: https://blog.tidelift >> .com/how-to-start-earning-money-for-your-open-source-project-with- >> tidelift. >> >> We didn't discuss yet what to do specifically with those funds, that >> should still be discussed in the future. >> >> Cheers, >> Joris >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >> _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From william.ayd at icloud.com Tue Jun 11 08:57:49 2019 From: william.ayd at icloud.com (William Ayd) Date: Tue, 11 Jun 2019 08:57:49 -0400 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: Just some counterpoints to consider: - $ 3,000 a month isn?t really that much, and if it?s just a number that a well-funded company chose for us chances are they are benefiting from it way more than we are - There is no such thing as free money; we have to consider how to account for and actually manage it (perhaps mitigated somewhat by NumFocus) - Advertising and ties to a corporate sponsorship may weaken the brand of pandas; at that point we may lose some creditability as open source volunteers - We don?t (AFAIK) have a plan on how to spend or allocate it Not totally against it but perhaps the last point above is the main sticking one. Do we have any idea how much we?d actually pocket out of the $ 3k they offer us and subsequently what we would do with it? Cover travel expenses? Support PyData conferences? Scholarships? - Will > On Jun 11, 2019, at 4:44 AM, Ralf Gommers wrote: > > > > On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche > wrote: > The current page about pandas (https://tidelift.com/lifter/search/pypi/pandas ) mentions $3,000 dollar a month (but I am not fully sure this is what is already available from their current subscribers, or if it is a prospect). > > It's not just a prospect, that's what you should/will get. NumPy and SciPy get the listed amounts too. > > Agreed that the NumPy amount is not that much. The amount gets determined automatically; it's some combination of customer interest, dependency analysis and size of the API surface. > > The current amounts are: > NumPy: $1000 > SciPy: $2500 > Pandas: $3000 > Matplotlib: n.a. > Scikit-learn: $1500 > Scikit-image: $50 > Statsmodels: $50 > > So there's an element of randomness, but the results are not completely surprising I think. The four libraries that get order thousands of dollars are the ones that large corporations are going to have the highest interest in. > > Cheers, > Ralf > > > Op za 8 jun. 2019 om 22:54 schreef William Ayd >: > What is the minimum amount we are asking for? The $1,000 a month for NumPy seems rather low and I thought previous emails had something in the range of $3k a month. > > I don?t think we necessarily need or would be that much improved by $12k per year so would rather aim higher if we are going to do this > >> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche > wrote: >> >> Hi all, >> >> We discussed this on the last dev chat, but putting it on the mailing list for those who were not present: we are planning to contact Tidelift to enter into a sponsor agreement for Pandas. >> >> The idea is to follow what NumPy (and recently also Scipy) did to have an agreement between Tidelift and NumFOCUS instead of an individual maintainer (see their announcement mail: https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html ). >> Blog with overview about Tidelift: https://blog.tidelift.com/how-to-start-earning-money-for-your-open-source-project-with-tidelift . >> >> We didn't discuss yet what to do specifically with those funds, that should still be discussed in the future. >> >> Cheers, >> Joris >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Tue Jun 11 09:03:15 2019 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Tue, 11 Jun 2019 08:03:15 -0500 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: On Tue, Jun 11, 2019 at 7:58 AM William Ayd via Pandas-dev < pandas-dev at python.org> wrote: > Just some counterpoints to consider: > > - $ 3,000 a month isn?t really that much, and if it?s just a number that a > well-funded company chose for us chances are they are benefiting from it > way more than we are > - There is no such thing as free money; we have to consider how to account > for and actually manage it (perhaps mitigated somewhat by NumFocus) > Perhaps Ralph can share how this has gone for NumPy. I imagine it's not too work on their end, thanks to NumFOCUS. > - Advertising and ties to a corporate sponsorship may weaken the brand of > pandas; at that point we may lose some creditability as open source > volunteers > Anecdotally, I don't think that's how the community views Tidelift. My perception (from Twitter, blogs / comments) is that it's been well received. > - We don?t (AFAIK) have a plan on how to spend or allocate it > > Not totally against it but perhaps the last point above is the main > sticking one. Do we have any idea how much we?d actually pocket out of the > $ 3k they offer us and subsequently what we would do with it? Cover travel > expenses? Support PyData conferences? Scholarships? > Agreed that we should set a purpose for this money (though, I have no objection to collecting while we set that dedicated purpose). > - Will > > On Jun 11, 2019, at 4:44 AM, Ralf Gommers wrote: > > > > On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> The current page about pandas ( >> https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 dollar a >> month (but I am not fully sure this is what is already available from their >> current subscribers, or if it is a prospect). >> > > It's not just a prospect, that's what you should/will get. NumPy and SciPy > get the listed amounts too. > > Agreed that the NumPy amount is not that much. The amount gets determined > automatically; it's some combination of customer interest, dependency > analysis and size of the API surface. > > The current amounts are: > NumPy: $1000 > SciPy: $2500 > Pandas: $3000 > Matplotlib: n.a. > Scikit-learn: $1500 > Scikit-image: $50 > Statsmodels: $50 > > So there's an element of randomness, but the results are not completely > surprising I think. The four libraries that get order thousands of dollars > are the ones that large corporations are going to have the highest interest > in. > > Cheers, > Ralf > > >> >> Op za 8 jun. 2019 om 22:54 schreef William Ayd : >> >>> What is the minimum amount we are asking for? The $1,000 a month for >>> NumPy seems rather low and I thought previous emails had something in the >>> range of $3k a month. >>> >>> I don?t think we necessarily need or would be that much improved by $12k >>> per year so would rather aim higher if we are going to do this >>> >>> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche < >>> jorisvandenbossche at gmail.com> wrote: >>> >>> Hi all, >>> >>> We discussed this on the last dev chat, but putting it on the mailing >>> list for those who were not present: we are planning to contact Tidelift to >>> enter into a sponsor agreement for Pandas. >>> >>> The idea is to follow what NumPy (and recently also Scipy) did to have >>> an agreement between Tidelift and NumFOCUS instead of an individual >>> maintainer (see their announcement mail: >>> https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html >>> ). >>> Blog with overview about Tidelift: https://blog.tidelift >>> .com/how-to-start-earning-money-for-your-open-source-project-with- >>> tidelift. >>> >>> We didn't discuss yet what to do specifically with those funds, that >>> should still be discussed in the future. >>> >>> Cheers, >>> Joris >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >>> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Tue Jun 11 09:32:40 2019 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Tue, 11 Jun 2019 15:32:40 +0200 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: On Tue, Jun 11, 2019 at 3:03 PM Tom Augspurger wrote: > > > On Tue, Jun 11, 2019 at 7:58 AM William Ayd via Pandas-dev < > pandas-dev at python.org> wrote: > >> Just some counterpoints to consider: >> >> - $ 3,000 a month isn?t really that much, and if it?s just a number that >> a well-funded company chose for us chances are they are benefiting from it >> way more than we are >> > "it's not really that much" is something I don't agree with. It doesn't employ someone, but it's enough to pay for things like developer meetups, hiring an extra GSoC student if a good one happens to come along, paying a web dev for a full redesign of the project website, etc. Each of those things is in the $5,000 - %15,000 range, and it's _very_ nice to be able to do them without having to look for funding first. Tidelift is a small (now ~25 employees) company by the way, and they have a real understanding of the open source sustainability issues and seem dedicated to helping fix it. - There is no such thing as free money; we have to consider how to account >> for and actually manage it (perhaps mitigated somewhat by NumFocus) >> > > Perhaps Ralph can share how this has gone for NumPy. I imagine it's not > too work on their end, thanks to NumFOCUS. > NumFOCUS handles receiving the money and associated admin. As the project you'll be responsible for the setup and ongoing tasks. For NumPy and SciPy I have done those tasks. It's a fairly minimal amount of work: https://github.com/numpy/numpy/pulls?q=is%3Apr+tidelift+is%3Aclosed. The main one was dealing with GitHub not recognizing our license, and you don't have that issue for Pandas (it's reported correctly as BSD-3 in the UI at https://github.com/pandas-dev/pandas). So it's probably a day of work for one person, to get familiar with the interface, check dependencies, release streams, paste in release notes, etc. And then ongoing maybe one or a couple of hours a month. So far it's been a much more effective way of spending time than, for example, grant writing. > >> - Advertising and ties to a corporate sponsorship may weaken the brand of >> pandas; at that point we may lose some creditability as open source >> volunteers >> > > Anecdotally, I don't think that's how the community views Tidelift. My > perception (from Twitter, blogs / comments) is that it's been well received. > Agree, the feedback I've seen is all quite positive. > >> - We don?t (AFAIK) have a plan on how to spend or allocate it >> >> Not totally against it but perhaps the last point above is the main >> sticking one. Do we have any idea how much we?d actually pocket out of the >> $ 3k they offer us and subsequently what we would do with it? Cover travel >> expenses? Support PyData conferences? Scholarships? >> > > Agreed that we should set a purpose for this money (though, I have no > objection to collecting while we set that dedicated purpose). > For NumPy and SciPy we haven't earmarked the funds yet. It's nice to build up a buffer first. One thing I'm thinking of is that we're participating in Google Season of Docs, and are getting more high quality applicants than Google will accept. So we could pay one or two tech writers from the funds. Our website and high level docs (tutorial, restructuring of all docs to guide users better) sure could use it:) My abstract advice would be: pay for things that require money (like a dev meeting) or don't get done for free. Don't pay for writing code unless the case is extremely compelling, because that'll be a drop in the bucket. Cheers, Ralf > >> - Will >> >> On Jun 11, 2019, at 4:44 AM, Ralf Gommers wrote: >> >> >> >> On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> The current page about pandas ( >>> https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 dollar >>> a month (but I am not fully sure this is what is already available from >>> their current subscribers, or if it is a prospect). >>> >> >> It's not just a prospect, that's what you should/will get. NumPy and >> SciPy get the listed amounts too. >> >> Agreed that the NumPy amount is not that much. The amount gets determined >> automatically; it's some combination of customer interest, dependency >> analysis and size of the API surface. >> >> The current amounts are: >> NumPy: $1000 >> SciPy: $2500 >> Pandas: $3000 >> Matplotlib: n.a. >> Scikit-learn: $1500 >> Scikit-image: $50 >> Statsmodels: $50 >> >> So there's an element of randomness, but the results are not completely >> surprising I think. The four libraries that get order thousands of dollars >> are the ones that large corporations are going to have the highest interest >> in. >> >> Cheers, >> Ralf >> >> >>> >>> Op za 8 jun. 2019 om 22:54 schreef William Ayd : >>> >>>> What is the minimum amount we are asking for? The $1,000 a month for >>>> NumPy seems rather low and I thought previous emails had something in the >>>> range of $3k a month. >>>> >>>> I don?t think we necessarily need or would be that much improved by >>>> $12k per year so would rather aim higher if we are going to do this >>>> >>>> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche < >>>> jorisvandenbossche at gmail.com> wrote: >>>> >>>> Hi all, >>>> >>>> We discussed this on the last dev chat, but putting it on the mailing >>>> list for those who were not present: we are planning to contact Tidelift to >>>> enter into a sponsor agreement for Pandas. >>>> >>>> The idea is to follow what NumPy (and recently also Scipy) did to have >>>> an agreement between Tidelift and NumFOCUS instead of an individual >>>> maintainer (see their announcement mail: >>>> https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html >>>> ). >>>> Blog with overview about Tidelift: https://blog.tidelift >>>> .com/how-to-start-earning-money-for-your-open-source-project-with- >>>> tidelift. >>>> >>>> We didn't discuss yet what to do specifically with those funds, that >>>> should still be discussed in the future. >>>> >>>> Cheers, >>>> Joris >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>>> >>>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From william.ayd at icloud.com Tue Jun 11 09:37:36 2019 From: william.ayd at icloud.com (William Ayd) Date: Tue, 11 Jun 2019 09:37:36 -0400 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: Great thanks for the clarification Tom and Ralf. All points make sense and sounds like Tidelift is doing good things then. I?m on board if everyone else is. - Will > On Jun 11, 2019, at 9:32 AM, Ralf Gommers wrote: > > > > On Tue, Jun 11, 2019 at 3:03 PM Tom Augspurger > wrote: > > > On Tue, Jun 11, 2019 at 7:58 AM William Ayd via Pandas-dev > wrote: > Just some counterpoints to consider: > > - $ 3,000 a month isn?t really that much, and if it?s just a number that a well-funded company chose for us chances are they are benefiting from it way more than we are > > "it's not really that much" is something I don't agree with. It doesn't employ someone, but it's enough to pay for things like developer meetups, hiring an extra GSoC student if a good one happens to come along, paying a web dev for a full redesign of the project website, etc. Each of those things is in the $5,000 - %15,000 range, and it's _very_ nice to be able to do them without having to look for funding first. > > Tidelift is a small (now ~25 employees) company by the way, and they have a real understanding of the open source sustainability issues and seem dedicated to helping fix it. > > - There is no such thing as free money; we have to consider how to account for and actually manage it (perhaps mitigated somewhat by NumFocus) > > Perhaps Ralph can share how this has gone for NumPy. I imagine it's not too work on their end, thanks to NumFOCUS. > > NumFOCUS handles receiving the money and associated admin. As the project you'll be responsible for the setup and ongoing tasks. For NumPy and SciPy I have done those tasks. It's a fairly minimal amount of work: https://github.com/numpy/numpy/pulls?q=is%3Apr+tidelift+is%3Aclosed . The main one was dealing with GitHub not recognizing our license, and you don't have that issue for Pandas (it's reported correctly as BSD-3 in the UI at https://github.com/pandas-dev/pandas ). > > So it's probably a day of work for one person, to get familiar with the interface, check dependencies, release streams, paste in release notes, etc. And then ongoing maybe one or a couple of hours a month. So far it's been a much more effective way of spending time than, for example, grant writing. > > > - Advertising and ties to a corporate sponsorship may weaken the brand of pandas; at that point we may lose some creditability as open source volunteers > > Anecdotally, I don't think that's how the community views Tidelift. My perception (from Twitter, blogs / comments) is that it's been well received. > > Agree, the feedback I've seen is all quite positive. > > > - We don?t (AFAIK) have a plan on how to spend or allocate it > > Not totally against it but perhaps the last point above is the main sticking one. Do we have any idea how much we?d actually pocket out of the $ 3k they offer us and subsequently what we would do with it? Cover travel expenses? Support PyData conferences? Scholarships? > > Agreed that we should set a purpose for this money (though, I have no objection to collecting while we set that dedicated purpose). > > For NumPy and SciPy we haven't earmarked the funds yet. It's nice to build up a buffer first. One thing I'm thinking of is that we're participating in Google Season of Docs, and are getting more high quality applicants than Google will accept. So we could pay one or two tech writers from the funds. Our website and high level docs (tutorial, restructuring of all docs to guide users better) sure could use it:) > > My abstract advice would be: pay for things that require money (like a dev meeting) or don't get done for free. Don't pay for writing code unless the case is extremely compelling, because that'll be a drop in the bucket. > > Cheers, > Ralf > > > > - Will > >> On Jun 11, 2019, at 4:44 AM, Ralf Gommers > wrote: >> >> >> >> On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche > wrote: >> The current page about pandas (https://tidelift.com/lifter/search/pypi/pandas ) mentions $3,000 dollar a month (but I am not fully sure this is what is already available from their current subscribers, or if it is a prospect). >> >> It's not just a prospect, that's what you should/will get. NumPy and SciPy get the listed amounts too. >> >> Agreed that the NumPy amount is not that much. The amount gets determined automatically; it's some combination of customer interest, dependency analysis and size of the API surface. >> >> The current amounts are: >> NumPy: $1000 >> SciPy: $2500 >> Pandas: $3000 >> Matplotlib: n.a. >> Scikit-learn: $1500 >> Scikit-image: $50 >> Statsmodels: $50 >> >> So there's an element of randomness, but the results are not completely surprising I think. The four libraries that get order thousands of dollars are the ones that large corporations are going to have the highest interest in. >> >> Cheers, >> Ralf >> >> >> Op za 8 jun. 2019 om 22:54 schreef William Ayd >: >> What is the minimum amount we are asking for? The $1,000 a month for NumPy seems rather low and I thought previous emails had something in the range of $3k a month. >> >> I don?t think we necessarily need or would be that much improved by $12k per year so would rather aim higher if we are going to do this >> >>> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche > wrote: >>> >>> Hi all, >>> >>> We discussed this on the last dev chat, but putting it on the mailing list for those who were not present: we are planning to contact Tidelift to enter into a sponsor agreement for Pandas. >>> >>> The idea is to follow what NumPy (and recently also Scipy) did to have an agreement between Tidelift and NumFOCUS instead of an individual maintainer (see their announcement mail: https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html ). >>> Blog with overview about Tidelift: https://blog.tidelift.com/how-to-start-earning-money-for-your-open-source-project-with-tidelift . >>> >>> We didn't discuss yet what to do specifically with those funds, that should still be discussed in the future. >>> >>> Cheers, >>> Joris >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Jun 11 09:41:40 2019 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 11 Jun 2019 15:41:40 +0200 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: Op di 11 jun. 2019 om 15:31 schreef Ralf Gommers : > > > On Tue, Jun 11, 2019 at 3:03 PM Tom Augspurger > wrote: > >> >> >> On Tue, Jun 11, 2019 at 7:58 AM William Ayd via Pandas-dev < >> pandas-dev at python.org> wrote: >> >>> Just some counterpoints to consider: >>> >>> - $ 3,000 a month isn?t really that much, and if it?s just a number that >>> a well-funded company chose for us chances are they are benefiting from it >>> way more than we are >>> >> > "it's not really that much" is something I don't agree with. It doesn't > employ someone, but it's enough to pay for things like developer meetups, > hiring an extra GSoC student if a good one happens to come along, paying a > web dev for a full redesign of the project website, etc. Each of those > things is in the $5,000 - %15,000 range, and it's _very_ nice to be able to > do them without having to look for funding first. > > Tidelift is a small (now ~25 employees) company by the way, and they have > a real understanding of the open source sustainability issues and seem > dedicated to helping fix it. > > - There is no such thing as free money; we have to consider how to account >>> for and actually manage it (perhaps mitigated somewhat by NumFocus) >>> >> >> Perhaps Ralph can share how this has gone for NumPy. I imagine it's not >> too work on their end, thanks to NumFOCUS. >> > > NumFOCUS handles receiving the money and associated admin. As the project > you'll be responsible for the setup and ongoing tasks. For NumPy and SciPy > I have done those tasks. It's a fairly minimal amount of work: > https://github.com/numpy/numpy/pulls?q=is%3Apr+tidelift+is%3Aclosed. The > main one was dealing with GitHub not recognizing our license, and you don't > have that issue for Pandas (it's reported correctly as BSD-3 in the UI at > https://github.com/pandas-dev/pandas). > > So it's probably a day of work for one person, to get familiar with the > interface, check dependencies, release streams, paste in release notes, > etc. And then ongoing maybe one or a couple of hours a month. So far it's > been a much more effective way of spending time than, for example, grant > writing. > > >> >>> - Advertising and ties to a corporate sponsorship may weaken the brand >>> of pandas; at that point we may lose some creditability as open source >>> volunteers >>> >> >> Anecdotally, I don't think that's how the community views Tidelift. My >> perception (from Twitter, blogs / comments) is that it's been well received. >> > > Agree, the feedback I've seen is all quite positive. > Additionally, I don't think there is any "advertisement" involved, at least not in the classical sense of adding adds for third-party companies in a side bar to our website for which we get money. Of course we will need to mention Tidelift in some way, e.g. in our sponsors / institutional partners section, but we already do that for some other companies as well (that employ core devs). > > >> >>> - We don?t (AFAIK) have a plan on how to spend or allocate it >>> >>> Not totally against it but perhaps the last point above is the main >>> sticking one. Do we have any idea how much we?d actually pocket out of the >>> $ 3k they offer us and subsequently what we would do with it? Cover travel >>> expenses? Support PyData conferences? Scholarships? >>> >> >> Agreed that we should set a purpose for this money (though, I have no >> objection to collecting while we set that dedicated purpose). >> > > Indeed we need to discuss this, but I don't think we already need to know *exactly* what we want to do with it before setting up a contract with Tidelift. It's good for me to alraedy start discussing it now, but maybe in a separate thread? > For NumPy and SciPy we haven't earmarked the funds yet. It's nice to build > up a buffer first. One thing I'm thinking of is that we're participating in > Google Season of Docs, and are getting more high quality applicants than > Google will accept. So we could pay one or two tech writers from the funds. > Our website and high level docs (tutorial, restructuring of all docs to > guide users better) sure could use it:) > > My abstract advice would be: pay for things that require money (like a dev > meeting) or don't get done for free. Don't pay for writing code unless the > case is extremely compelling, because that'll be a drop in the bucket. > > Cheers, > Ralf > > > >> >>> - Will >>> >>> On Jun 11, 2019, at 4:44 AM, Ralf Gommers >>> wrote: >>> >>> >>> >>> On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche < >>> jorisvandenbossche at gmail.com> wrote: >>> >>>> The current page about pandas ( >>>> https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 dollar >>>> a month (but I am not fully sure this is what is already available from >>>> their current subscribers, or if it is a prospect). >>>> >>> >>> It's not just a prospect, that's what you should/will get. NumPy and >>> SciPy get the listed amounts too. >>> >>> Agreed that the NumPy amount is not that much. The amount gets >>> determined automatically; it's some combination of customer interest, >>> dependency analysis and size of the API surface. >>> >>> The current amounts are: >>> NumPy: $1000 >>> SciPy: $2500 >>> Pandas: $3000 >>> Matplotlib: n.a. >>> Scikit-learn: $1500 >>> Scikit-image: $50 >>> Statsmodels: $50 >>> >>> So there's an element of randomness, but the results are not completely >>> surprising I think. The four libraries that get order thousands of dollars >>> are the ones that large corporations are going to have the highest interest >>> in. >>> >>> Cheers, >>> Ralf >>> >>> >>>> >>>> Op za 8 jun. 2019 om 22:54 schreef William Ayd >>> >: >>>> >>>>> What is the minimum amount we are asking for? The $1,000 a month for >>>>> NumPy seems rather low and I thought previous emails had something in the >>>>> range of $3k a month. >>>>> >>>>> I don?t think we necessarily need or would be that much improved by >>>>> $12k per year so would rather aim higher if we are going to do this >>>>> >>>>> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche < >>>>> jorisvandenbossche at gmail.com> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> We discussed this on the last dev chat, but putting it on the mailing >>>>> list for those who were not present: we are planning to contact Tidelift to >>>>> enter into a sponsor agreement for Pandas. >>>>> >>>>> The idea is to follow what NumPy (and recently also Scipy) did to have >>>>> an agreement between Tidelift and NumFOCUS instead of an individual >>>>> maintainer (see their announcement mail: >>>>> https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html >>>>> ). >>>>> Blog with overview about Tidelift: https://blog.tidelift >>>>> .com/how-to-start-earning-money-for-your-open-source-project-with- >>>>> tidelift. >>>>> >>>>> We didn't discuss yet what to do specifically with those funds, that >>>>> should still be discussed in the future. >>>>> >>>>> Cheers, >>>>> Joris >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>>> >>>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Jun 11 10:09:23 2019 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 11 Jun 2019 09:09:23 -0500 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: Personally, I would recommend putting most of the money in your own pockets. The whole idea of Tidelift (as I understand it) is for the individuals doing work that is of importance to project users (to whom Tidelift is providing indemnification and "insurance" against defects) to get paid for their labor. So I think the most honest way to use the money is to put it in your respective bank accounts. If you've getting a little bit of money to spend on yourself, doesn't that make doing the maintenance work a bit less thankless? If you don't pay yourselves, I think it actually "breaks" Tidelift's pitch to customers which is that open source projects need to have a higher fraction of compensated maintenance and support work than they do now. How you allocate the money to each other is something you can debate privately On Tue, Jun 11, 2019 at 8:42 AM Joris Van den Bossche wrote: > > > > Op di 11 jun. 2019 om 15:31 schreef Ralf Gommers : >> >> >> >> On Tue, Jun 11, 2019 at 3:03 PM Tom Augspurger wrote: >>> >>> >>> >>> On Tue, Jun 11, 2019 at 7:58 AM William Ayd via Pandas-dev wrote: >>>> >>>> Just some counterpoints to consider: >>>> >>>> - $ 3,000 a month isn?t really that much, and if it?s just a number that a well-funded company chose for us chances are they are benefiting from it way more than we are >> >> >> "it's not really that much" is something I don't agree with. It doesn't employ someone, but it's enough to pay for things like developer meetups, hiring an extra GSoC student if a good one happens to come along, paying a web dev for a full redesign of the project website, etc. Each of those things is in the $5,000 - %15,000 range, and it's _very_ nice to be able to do them without having to look for funding first. >> >> Tidelift is a small (now ~25 employees) company by the way, and they have a real understanding of the open source sustainability issues and seem dedicated to helping fix it. >> >>>> - There is no such thing as free money; we have to consider how to account for and actually manage it (perhaps mitigated somewhat by NumFocus) >>> >>> >>> Perhaps Ralph can share how this has gone for NumPy. I imagine it's not too work on their end, thanks to NumFOCUS. >> >> >> NumFOCUS handles receiving the money and associated admin. As the project you'll be responsible for the setup and ongoing tasks. For NumPy and SciPy I have done those tasks. It's a fairly minimal amount of work: https://github.com/numpy/numpy/pulls?q=is%3Apr+tidelift+is%3Aclosed. The main one was dealing with GitHub not recognizing our license, and you don't have that issue for Pandas (it's reported correctly as BSD-3 in the UI at https://github.com/pandas-dev/pandas). >> >> So it's probably a day of work for one person, to get familiar with the interface, check dependencies, release streams, paste in release notes, etc. And then ongoing maybe one or a couple of hours a month. So far it's been a much more effective way of spending time than, for example, grant writing. >> >>> >>>> >>>> - Advertising and ties to a corporate sponsorship may weaken the brand of pandas; at that point we may lose some creditability as open source volunteers >>> >>> >>> Anecdotally, I don't think that's how the community views Tidelift. My perception (from Twitter, blogs / comments) is that it's been well received. >> >> >> Agree, the feedback I've seen is all quite positive. > > > Additionally, I don't think there is any "advertisement" involved, at least not in the classical sense of adding adds for third-party companies in a side bar to our website for which we get money. Of course we will need to mention Tidelift in some way, e.g. in our sponsors / institutional partners section, but we already do that for some other companies as well (that employ core devs). > >> >> >>> >>>> >>>> - We don?t (AFAIK) have a plan on how to spend or allocate it >>>> >>>> Not totally against it but perhaps the last point above is the main sticking one. Do we have any idea how much we?d actually pocket out of the $ 3k they offer us and subsequently what we would do with it? Cover travel expenses? Support PyData conferences? Scholarships? >>> >>> >>> Agreed that we should set a purpose for this money (though, I have no objection to collecting while we set that dedicated purpose). >> >> > Indeed we need to discuss this, but I don't think we already need to know *exactly* what we want to do with it before setting up a contract with Tidelift. It's good for me to alraedy start discussing it now, but maybe in a separate thread? > >> >> For NumPy and SciPy we haven't earmarked the funds yet. It's nice to build up a buffer first. One thing I'm thinking of is that we're participating in Google Season of Docs, and are getting more high quality applicants than Google will accept. So we could pay one or two tech writers from the funds. Our website and high level docs (tutorial, restructuring of all docs to guide users better) sure could use it:) >> >> My abstract advice would be: pay for things that require money (like a dev meeting) or don't get done for free. Don't pay for writing code unless the case is extremely compelling, because that'll be a drop in the bucket. >> >> Cheers, >> Ralf >> >> >>> >>>> >>>> - Will >>>> >>>> On Jun 11, 2019, at 4:44 AM, Ralf Gommers wrote: >>>> >>>> >>>> >>>> On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche wrote: >>>>> >>>>> The current page about pandas (https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 dollar a month (but I am not fully sure this is what is already available from their current subscribers, or if it is a prospect). >>>> >>>> >>>> It's not just a prospect, that's what you should/will get. NumPy and SciPy get the listed amounts too. >>>> >>>> Agreed that the NumPy amount is not that much. The amount gets determined automatically; it's some combination of customer interest, dependency analysis and size of the API surface. >>>> >>>> The current amounts are: >>>> NumPy: $1000 >>>> SciPy: $2500 >>>> Pandas: $3000 >>>> Matplotlib: n.a. >>>> Scikit-learn: $1500 >>>> Scikit-image: $50 >>>> Statsmodels: $50 >>>> >>>> So there's an element of randomness, but the results are not completely surprising I think. The four libraries that get order thousands of dollars are the ones that large corporations are going to have the highest interest in. >>>> >>>> Cheers, >>>> Ralf >>>> >>>>> >>>>> >>>>> Op za 8 jun. 2019 om 22:54 schreef William Ayd : >>>>>> >>>>>> What is the minimum amount we are asking for? The $1,000 a month for NumPy seems rather low and I thought previous emails had something in the range of $3k a month. >>>>>> >>>>>> I don?t think we necessarily need or would be that much improved by $12k per year so would rather aim higher if we are going to do this >>>>>> >>>>>> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> We discussed this on the last dev chat, but putting it on the mailing list for those who were not present: we are planning to contact Tidelift to enter into a sponsor agreement for Pandas. >>>>>> >>>>>> The idea is to follow what NumPy (and recently also Scipy) did to have an agreement between Tidelift and NumFOCUS instead of an individual maintainer (see their announcement mail: https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html). >>>>>> Blog with overview about Tidelift: https://blog.tidelift.com/how-to-start-earning-money-for-your-open-source-project-with-tidelift. >>>>>> >>>>>> We didn't discuss yet what to do specifically with those funds, that should still be discussed in the future. >>>>>> >>>>>> Cheers, >>>>>> Joris >>>>>> _______________________________________________ >>>>>> Pandas-dev mailing list >>>>>> Pandas-dev at python.org >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From wesmckinn at gmail.com Tue Jun 11 10:12:20 2019 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 11 Jun 2019 09:12:20 -0500 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: > How you allocate the money to each other is something you can debate privately On this, I'm sure that you could set up a lightweight virtual "timesheet" so you can put yourselves "on the clock" when you're doing project maintenance work (there are many of these online, I just read about https://www.clockspot.com/ recently) to make time reporting a bit more accurate On Tue, Jun 11, 2019 at 9:09 AM Wes McKinney wrote: > > Personally, I would recommend putting most of the money in your own > pockets. The whole idea of Tidelift (as I understand it) is for the > individuals doing work that is of importance to project users (to whom > Tidelift is providing indemnification and "insurance" against defects) > to get paid for their labor. So I think the most honest way to use the > money is to put it in your respective bank accounts. If you've getting > a little bit of money to spend on yourself, doesn't that make doing > the maintenance work a bit less thankless? If you don't pay > yourselves, I think it actually "breaks" Tidelift's pitch to customers > which is that open source projects need to have a higher fraction of > compensated maintenance and support work than they do now. > > How you allocate the money to each other is something you can debate privately > > On Tue, Jun 11, 2019 at 8:42 AM Joris Van den Bossche > wrote: > > > > > > > > Op di 11 jun. 2019 om 15:31 schreef Ralf Gommers : > >> > >> > >> > >> On Tue, Jun 11, 2019 at 3:03 PM Tom Augspurger wrote: > >>> > >>> > >>> > >>> On Tue, Jun 11, 2019 at 7:58 AM William Ayd via Pandas-dev wrote: > >>>> > >>>> Just some counterpoints to consider: > >>>> > >>>> - $ 3,000 a month isn?t really that much, and if it?s just a number that a well-funded company chose for us chances are they are benefiting from it way more than we are > >> > >> > >> "it's not really that much" is something I don't agree with. It doesn't employ someone, but it's enough to pay for things like developer meetups, hiring an extra GSoC student if a good one happens to come along, paying a web dev for a full redesign of the project website, etc. Each of those things is in the $5,000 - %15,000 range, and it's _very_ nice to be able to do them without having to look for funding first. > >> > >> Tidelift is a small (now ~25 employees) company by the way, and they have a real understanding of the open source sustainability issues and seem dedicated to helping fix it. > >> > >>>> - There is no such thing as free money; we have to consider how to account for and actually manage it (perhaps mitigated somewhat by NumFocus) > >>> > >>> > >>> Perhaps Ralph can share how this has gone for NumPy. I imagine it's not too work on their end, thanks to NumFOCUS. > >> > >> > >> NumFOCUS handles receiving the money and associated admin. As the project you'll be responsible for the setup and ongoing tasks. For NumPy and SciPy I have done those tasks. It's a fairly minimal amount of work: https://github.com/numpy/numpy/pulls?q=is%3Apr+tidelift+is%3Aclosed. The main one was dealing with GitHub not recognizing our license, and you don't have that issue for Pandas (it's reported correctly as BSD-3 in the UI at https://github.com/pandas-dev/pandas). > >> > >> So it's probably a day of work for one person, to get familiar with the interface, check dependencies, release streams, paste in release notes, etc. And then ongoing maybe one or a couple of hours a month. So far it's been a much more effective way of spending time than, for example, grant writing. > >> > >>> > >>>> > >>>> - Advertising and ties to a corporate sponsorship may weaken the brand of pandas; at that point we may lose some creditability as open source volunteers > >>> > >>> > >>> Anecdotally, I don't think that's how the community views Tidelift. My perception (from Twitter, blogs / comments) is that it's been well received. > >> > >> > >> Agree, the feedback I've seen is all quite positive. > > > > > > Additionally, I don't think there is any "advertisement" involved, at least not in the classical sense of adding adds for third-party companies in a side bar to our website for which we get money. Of course we will need to mention Tidelift in some way, e.g. in our sponsors / institutional partners section, but we already do that for some other companies as well (that employ core devs). > > > >> > >> > >>> > >>>> > >>>> - We don?t (AFAIK) have a plan on how to spend or allocate it > >>>> > >>>> Not totally against it but perhaps the last point above is the main sticking one. Do we have any idea how much we?d actually pocket out of the $ 3k they offer us and subsequently what we would do with it? Cover travel expenses? Support PyData conferences? Scholarships? > >>> > >>> > >>> Agreed that we should set a purpose for this money (though, I have no objection to collecting while we set that dedicated purpose). > >> > >> > > Indeed we need to discuss this, but I don't think we already need to know *exactly* what we want to do with it before setting up a contract with Tidelift. It's good for me to alraedy start discussing it now, but maybe in a separate thread? > > > >> > >> For NumPy and SciPy we haven't earmarked the funds yet. It's nice to build up a buffer first. One thing I'm thinking of is that we're participating in Google Season of Docs, and are getting more high quality applicants than Google will accept. So we could pay one or two tech writers from the funds. Our website and high level docs (tutorial, restructuring of all docs to guide users better) sure could use it:) > >> > >> My abstract advice would be: pay for things that require money (like a dev meeting) or don't get done for free. Don't pay for writing code unless the case is extremely compelling, because that'll be a drop in the bucket. > >> > >> Cheers, > >> Ralf > >> > >> > >>> > >>>> > >>>> - Will > >>>> > >>>> On Jun 11, 2019, at 4:44 AM, Ralf Gommers wrote: > >>>> > >>>> > >>>> > >>>> On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche wrote: > >>>>> > >>>>> The current page about pandas (https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 dollar a month (but I am not fully sure this is what is already available from their current subscribers, or if it is a prospect). > >>>> > >>>> > >>>> It's not just a prospect, that's what you should/will get. NumPy and SciPy get the listed amounts too. > >>>> > >>>> Agreed that the NumPy amount is not that much. The amount gets determined automatically; it's some combination of customer interest, dependency analysis and size of the API surface. > >>>> > >>>> The current amounts are: > >>>> NumPy: $1000 > >>>> SciPy: $2500 > >>>> Pandas: $3000 > >>>> Matplotlib: n.a. > >>>> Scikit-learn: $1500 > >>>> Scikit-image: $50 > >>>> Statsmodels: $50 > >>>> > >>>> So there's an element of randomness, but the results are not completely surprising I think. The four libraries that get order thousands of dollars are the ones that large corporations are going to have the highest interest in. > >>>> > >>>> Cheers, > >>>> Ralf > >>>> > >>>>> > >>>>> > >>>>> Op za 8 jun. 2019 om 22:54 schreef William Ayd : > >>>>>> > >>>>>> What is the minimum amount we are asking for? The $1,000 a month for NumPy seems rather low and I thought previous emails had something in the range of $3k a month. > >>>>>> > >>>>>> I don?t think we necessarily need or would be that much improved by $12k per year so would rather aim higher if we are going to do this > >>>>>> > >>>>>> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche wrote: > >>>>>> > >>>>>> Hi all, > >>>>>> > >>>>>> We discussed this on the last dev chat, but putting it on the mailing list for those who were not present: we are planning to contact Tidelift to enter into a sponsor agreement for Pandas. > >>>>>> > >>>>>> The idea is to follow what NumPy (and recently also Scipy) did to have an agreement between Tidelift and NumFOCUS instead of an individual maintainer (see their announcement mail: https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html). > >>>>>> Blog with overview about Tidelift: https://blog.tidelift.com/how-to-start-earning-money-for-your-open-source-project-with-tidelift. > >>>>>> > >>>>>> We didn't discuss yet what to do specifically with those funds, that should still be discussed in the future. > >>>>>> > >>>>>> Cheers, > >>>>>> Joris > >>>>>> _______________________________________________ > >>>>>> Pandas-dev mailing list > >>>>>> Pandas-dev at python.org > >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>>>> > >>>>>> > >>>>> _______________________________________________ > >>>>> Pandas-dev mailing list > >>>>> Pandas-dev at python.org > >>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>> > >>>> > >>>> _______________________________________________ > >>>> Pandas-dev mailing list > >>>> Pandas-dev at python.org > >>>> https://mail.python.org/mailman/listinfo/pandas-dev > >> > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev From andy.terrel at gmail.com Tue Jun 11 10:56:04 2019 From: andy.terrel at gmail.com (Andy Ray Terrel) Date: Tue, 11 Jun 2019 09:56:04 -0500 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: While the original lifter agreement was an individual contract, in our negotiations with Tidelift, NumFOCUS has explicitly sought a model that allows the project to split the money how they prefer. This was always Tidelift's intention, it was just faster and easier to scale to focus on paying individuals. I do like the idea of paying for maintence work, I would recommend we set up folks as contractors with NumFOCUS rather than just pocketing money. It will give a lot more legal protection. Then if some folks don't want to take the cash you they can donate their time and be recognized as in-kind donations, which might have some tax deductions. It is something I would volunteer to help manage in order to learn how other projects might use the same techniques. -- Andy On Tue, Jun 11, 2019 at 9:13 AM Wes McKinney wrote: > > How you allocate the money to each other is something you can debate > privately > > On this, I'm sure that you could set up a lightweight virtual > "timesheet" so you can put yourselves "on the clock" when you're doing > project maintenance work (there are many of these online, I just read > about https://www.clockspot.com/ recently) to make time reporting a > bit more accurate > > On Tue, Jun 11, 2019 at 9:09 AM Wes McKinney wrote: > > > > Personally, I would recommend putting most of the money in your own > > pockets. The whole idea of Tidelift (as I understand it) is for the > > individuals doing work that is of importance to project users (to whom > > Tidelift is providing indemnification and "insurance" against defects) > > to get paid for their labor. So I think the most honest way to use the > > money is to put it in your respective bank accounts. If you've getting > > a little bit of money to spend on yourself, doesn't that make doing > > the maintenance work a bit less thankless? If you don't pay > > yourselves, I think it actually "breaks" Tidelift's pitch to customers > > which is that open source projects need to have a higher fraction of > > compensated maintenance and support work than they do now. > > > > How you allocate the money to each other is something you can debate > privately > > > > On Tue, Jun 11, 2019 at 8:42 AM Joris Van den Bossche > > wrote: > > > > > > > > > > > > Op di 11 jun. 2019 om 15:31 schreef Ralf Gommers < > ralf.gommers at gmail.com>: > > >> > > >> > > >> > > >> On Tue, Jun 11, 2019 at 3:03 PM Tom Augspurger < > tom.augspurger88 at gmail.com> wrote: > > >>> > > >>> > > >>> > > >>> On Tue, Jun 11, 2019 at 7:58 AM William Ayd via Pandas-dev < > pandas-dev at python.org> wrote: > > >>>> > > >>>> Just some counterpoints to consider: > > >>>> > > >>>> - $ 3,000 a month isn?t really that much, and if it?s just a number > that a well-funded company chose for us chances are they are benefiting > from it way more than we are > > >> > > >> > > >> "it's not really that much" is something I don't agree with. It > doesn't employ someone, but it's enough to pay for things like developer > meetups, hiring an extra GSoC student if a good one happens to come along, > paying a web dev for a full redesign of the project website, etc. Each of > those things is in the $5,000 - %15,000 range, and it's _very_ nice to be > able to do them without having to look for funding first. > > >> > > >> Tidelift is a small (now ~25 employees) company by the way, and they > have a real understanding of the open source sustainability issues and seem > dedicated to helping fix it. > > >> > > >>>> - There is no such thing as free money; we have to consider how to > account for and actually manage it (perhaps mitigated somewhat by NumFocus) > > >>> > > >>> > > >>> Perhaps Ralph can share how this has gone for NumPy. I imagine it's > not too work on their end, thanks to NumFOCUS. > > >> > > >> > > >> NumFOCUS handles receiving the money and associated admin. As the > project you'll be responsible for the setup and ongoing tasks. For NumPy > and SciPy I have done those tasks. It's a fairly minimal amount of work: > https://github.com/numpy/numpy/pulls?q=is%3Apr+tidelift+is%3Aclosed. The > main one was dealing with GitHub not recognizing our license, and you don't > have that issue for Pandas (it's reported correctly as BSD-3 in the UI at > https://github.com/pandas-dev/pandas). > > >> > > >> So it's probably a day of work for one person, to get familiar with > the interface, check dependencies, release streams, paste in release notes, > etc. And then ongoing maybe one or a couple of hours a month. So far it's > been a much more effective way of spending time than, for example, grant > writing. > > >> > > >>> > > >>>> > > >>>> - Advertising and ties to a corporate sponsorship may weaken the > brand of pandas; at that point we may lose some creditability as open > source volunteers > > >>> > > >>> > > >>> Anecdotally, I don't think that's how the community views Tidelift. > My perception (from Twitter, blogs / comments) is that it's been well > received. > > >> > > >> > > >> Agree, the feedback I've seen is all quite positive. > > > > > > > > > Additionally, I don't think there is any "advertisement" involved, at > least not in the classical sense of adding adds for third-party companies > in a side bar to our website for which we get money. Of course we will need > to mention Tidelift in some way, e.g. in our sponsors / institutional > partners section, but we already do that for some other companies as well > (that employ core devs). > > > > > >> > > >> > > >>> > > >>>> > > >>>> - We don?t (AFAIK) have a plan on how to spend or allocate it > > >>>> > > >>>> Not totally against it but perhaps the last point above is the main > sticking one. Do we have any idea how much we?d actually pocket out of the > $ 3k they offer us and subsequently what we would do with it? Cover travel > expenses? Support PyData conferences? Scholarships? > > >>> > > >>> > > >>> Agreed that we should set a purpose for this money (though, I have > no objection to collecting while we set that dedicated purpose). > > >> > > >> > > > Indeed we need to discuss this, but I don't think we already need to > know *exactly* what we want to do with it before setting up a contract with > Tidelift. It's good for me to alraedy start discussing it now, but maybe in > a separate thread? > > > > > >> > > >> For NumPy and SciPy we haven't earmarked the funds yet. It's nice to > build up a buffer first. One thing I'm thinking of is that we're > participating in Google Season of Docs, and are getting more high quality > applicants than Google will accept. So we could pay one or two tech writers > from the funds. Our website and high level docs (tutorial, restructuring of > all docs to guide users better) sure could use it:) > > >> > > >> My abstract advice would be: pay for things that require money (like > a dev meeting) or don't get done for free. Don't pay for writing code > unless the case is extremely compelling, because that'll be a drop in the > bucket. > > >> > > >> Cheers, > > >> Ralf > > >> > > >> > > >>> > > >>>> > > >>>> - Will > > >>>> > > >>>> On Jun 11, 2019, at 4:44 AM, Ralf Gommers > wrote: > > >>>> > > >>>> > > >>>> > > >>>> On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > > >>>>> > > >>>>> The current page about pandas ( > https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 dollar a > month (but I am not fully sure this is what is already available from their > current subscribers, or if it is a prospect). > > >>>> > > >>>> > > >>>> It's not just a prospect, that's what you should/will get. NumPy > and SciPy get the listed amounts too. > > >>>> > > >>>> Agreed that the NumPy amount is not that much. The amount gets > determined automatically; it's some combination of customer interest, > dependency analysis and size of the API surface. > > >>>> > > >>>> The current amounts are: > > >>>> NumPy: $1000 > > >>>> SciPy: $2500 > > >>>> Pandas: $3000 > > >>>> Matplotlib: n.a. > > >>>> Scikit-learn: $1500 > > >>>> Scikit-image: $50 > > >>>> Statsmodels: $50 > > >>>> > > >>>> So there's an element of randomness, but the results are not > completely surprising I think. The four libraries that get order thousands > of dollars are the ones that large corporations are going to have the > highest interest in. > > >>>> > > >>>> Cheers, > > >>>> Ralf > > >>>> > > >>>>> > > >>>>> > > >>>>> Op za 8 jun. 2019 om 22:54 schreef William Ayd < > william.ayd at icloud.com>: > > >>>>>> > > >>>>>> What is the minimum amount we are asking for? The $1,000 a month > for NumPy seems rather low and I thought previous emails had something in > the range of $3k a month. > > >>>>>> > > >>>>>> I don?t think we necessarily need or would be that much improved > by $12k per year so would rather aim higher if we are going to do this > > >>>>>> > > >>>>>> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > > >>>>>> > > >>>>>> Hi all, > > >>>>>> > > >>>>>> We discussed this on the last dev chat, but putting it on the > mailing list for those who were not present: we are planning to contact > Tidelift to enter into a sponsor agreement for Pandas. > > >>>>>> > > >>>>>> The idea is to follow what NumPy (and recently also Scipy) did to > have an agreement between Tidelift and NumFOCUS instead of an individual > maintainer (see their announcement mail: > https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html > ). > > >>>>>> Blog with overview about Tidelift: > https://blog.tidelift.com/how-to-start-earning-money-for-your-open-source-project-with-tidelift > . > > >>>>>> > > >>>>>> We didn't discuss yet what to do specifically with those funds, > that should still be discussed in the future. > > >>>>>> > > >>>>>> Cheers, > > >>>>>> Joris > > >>>>>> _______________________________________________ > > >>>>>> Pandas-dev mailing list > > >>>>>> Pandas-dev at python.org > > >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev > > >>>>>> > > >>>>>> > > >>>>> _______________________________________________ > > >>>>> Pandas-dev mailing list > > >>>>> Pandas-dev at python.org > > >>>>> https://mail.python.org/mailman/listinfo/pandas-dev > > >>>> > > >>>> > > >>>> _______________________________________________ > > >>>> Pandas-dev mailing list > > >>>> Pandas-dev at python.org > > >>>> https://mail.python.org/mailman/listinfo/pandas-dev > > >> > > >> _______________________________________________ > > >> Pandas-dev mailing list > > >> Pandas-dev at python.org > > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > > _______________________________________________ > > > Pandas-dev mailing list > > > Pandas-dev at python.org > > > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From william.ayd at icloud.com Tue Jun 11 11:54:40 2019 From: william.ayd at icloud.com (William Ayd) Date: Tue, 11 Jun 2019 11:54:40 -0400 Subject: [Pandas-dev] Pandas dev sprint June 27-30 @ Nashville In-Reply-To: References: Message-ID: Doesn?t sound like it so I added a new section to our dev notes on google drive if it helps: https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit# > On Jun 10, 2019, at 7:56 PM, Brock Mendel wrote: > > Is there an agenda document somewhere? > > On Mon, Jun 10, 2019 at 1:02 PM Wes McKinney > wrote: > To confirm, the location for the sprints is in Downtown Nashville near > the state capitol building. Specific details about the location will > be shared non-publicly with sprint participants > > On Tue, Jun 4, 2019 at 7:50 AM Wes McKinney > wrote: > > > > Yes, I'd recommend within reach of East Nashville or Downtown > > Nashville. The earlier you can book accommodations the better! > > > > I don't have an accurate headcount yet so I haven't been able to > > arrange the location to meet yet. > > > > On Tue, Jun 4, 2019 at 12:59 AM Matthew Roeschke > > > wrote: > > > > > > Wes, any tips where in Nashville we should book accommodations or generally where the co-working space would be? I'm looking to book my accomodations fairly soon. > > > > > > Thanks. > > > > > > On Thu, May 30, 2019 at 10:57 AM Wes McKinney > wrote: > > >> > > >> @Joris if you have a complete headcount for the meeting please let me > > >> know so I can work on securing a space for us to work. Just to confirm > > >> it's the 27th through the 30th inclusive, so 4 full days of workspace > > >> required? > > >> > > >> Thanks > > >> > > >> On Sat, May 25, 2019 at 3:44 PM Chang She > wrote: > > >> > > > >> > Oh this was more me just carving time out as a forcing function for myself. I?ll be +13 hour ahead so certainly not expecting video conferences. Code reviews would be appreciated. > > >> > > > >> > As for the read_parquet item, I will either attach to an existing issue or open a new one so discussion can happen on github. > > >> > > > >> > Thanks. > > >> > > > >> > On Saturday, May 25, 2019, Joris Van den Bossche > wrote: > > >> >> > > >> >> You are certainly welcome to sprint those days as well. I only can't promise that we can do any things to improve remote participation such as video meetings (but there should be a bunch of core devs active, which might give faster feedback on PRs if needed). > > >> >> > > >> >> Op di 14 mei 2019 om 22:59 schreef Chang She >: > > >> >>> > > >> >>> I'll be out of the country during those dates. Can I still join in remotely? > > >> >>> > > >> >>> Here's what I'd be interested in working on if there's appetite for these to be part of pandas and friends: > > >> >>> > > >> >>> 1. A stale PR on Series.explode I haven't had any time to finish up (https://github.com/pandas-dev/pandas/pull/24366 ). > > >> >> > > >> >> > > >> >> I think there is still certainly interest in such a function (I have regularly needed something like that myself). > > >> >> > > >> >>> > > >> >>> 2. Open sourcing an improvement to the pandas-redshift connector that speeds up the ingestion of medium amounts of data using a combination of unload + read_csv + multiprocessing. > > >> >> > > >> >> > > >> >> That reminds me: it might be good to mention the pandas-redshift packages somewhere in the docs (in the ecosystem page or in the sql docs). > > >> >> > > >> >>> > > >> >>> 3. A minor improvement to allow read_parquet to work with globs directly. This makes it a lot easier for pandas to read parquet generated by Spark. > > >> >>> > > >> >> Do you know if there is already an open issue about this? > > >> >> > > >> >> Best, > > >> >> Joris > > >> >> > > >> >>> > > >> >>> > > >> >>> > > >> >>> On Tue, May 14, 2019 at 4:50 AM Joris Van den Bossche > wrote: > > >> >>>> > > >> >>>> Dear all, > > >> >>>> > > >> >>>> We are planning to do a pandas sprint end of June in Nashville (Tennessee, USA): June 27-30. We will be meeting with some of the core devs (so not a sprint to jump-start newcomers in this case), but sending this to the mailing list to invite other pandas (or related libraries) contributors. > > >> >>>> The exact planning of the sprint still needs to be discussed, but we will probably be hacking and discussing on pandas, extension arrays, next versions of pandas, etc. > > >> >>>> > > >> >>>> So if you are interested, let me know something! We want to keep the number of participants somewhat limited, and also need to plan the location and funding, so please state your interest before May 30. > > >> >>>> If you would like to participate, but not sure if you would fit at such a sprint, don't hesitate to mail me personally. > > >> >>>> > > >> >>>> Best, > > >> >>>> Joris > > >> >>>> > > >> >>>> > > >> >>>> _______________________________________________ > > >> >>>> Pandas-dev mailing list > > >> >>>> Pandas-dev at python.org > > >> >>>> https://mail.python.org/mailman/listinfo/pandas-dev > > >> > > > >> > _______________________________________________ > > >> > Pandas-dev mailing list > > >> > Pandas-dev at python.org > > >> > https://mail.python.org/mailman/listinfo/pandas-dev > > >> _______________________________________________ > > >> Pandas-dev mailing list > > >> Pandas-dev at python.org > > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > > > > > > > > -- > > > Matthew Roeschke > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.gommers at gmail.com Tue Jun 11 12:18:19 2019 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Tue, 11 Jun 2019 18:18:19 +0200 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: On Tue, Jun 11, 2019 at 4:56 PM Andy Ray Terrel wrote: > While the original lifter agreement was an individual contract, in our > negotiations with Tidelift, NumFOCUS has explicitly sought a model that > allows the project to split the money how they prefer. This was always > Tidelift's intention, it was just faster and easier to scale to focus on > paying individuals. > +1 the project decides for themselves is the intent and a good principle. > I do like the idea of paying for maintence work, I would recommend we set > up folks as contractors with NumFOCUS rather than just pocketing money. It > will give a lot more legal protection. Then if some folks don't want to > take the cash you they can donate their time and be recognized as in-kind > donations, which might have some tax deductions. > Keep in mind that this has a lot of potential issues. Examples: 1. Who decides who gets paid, and how? The pandas repo has 1500+ contributors. Lots of potential for friction over small amount of $. 2. Many people have employment contracts, those typically forbid contracting on the side. So inherently unfair to distribute only to those who are in a position to accept the money. 3. You're now introducing lots of extra paperwork and admin, both directly and indirectly (who wants to deal with the extra complications when filing your taxes?). 4. It may create other weird social dynamics. E.g. if money is now directly coupled to a commit bit, that makes the "who do we give commit rights and when" a potentially more loaded question. And, dividing it into N chunks, the funding becomes nice beer money and a thank you for volunteering. Could be exactly what you'd prefer as a team. But that's imho more in line with the current version of Patreon or GitHub Sponsors rather then with what Tidelift is aiming for. I'd like the idea of "paying for maintenance" if there were enough money to employ people. But realistically, that will take many years. The Tidelift slogan on this is unrealistic for a project like Pandas where maintenance effort is many FTEs; it's perhaps feasible for your typical Javascript library that's popular but small enough for one person maintaining it part-time. > It is something I would volunteer to help manage in order to learn how > other projects might use the same techniques. > > -- Andy > > On Tue, Jun 11, 2019 at 9:13 AM Wes McKinney wrote: > >> > How you allocate the money to each other is something you can debate >> privately >> >> On this, I'm sure that you could set up a lightweight virtual >> "timesheet" so you can put yourselves "on the clock" when you're doing >> project maintenance work (there are many of these online, I just read >> about https://www.clockspot.com/ recently) to make time reporting a >> bit more accurate >> >> On Tue, Jun 11, 2019 at 9:09 AM Wes McKinney wrote: >> > >> > Personally, I would recommend putting most of the money in your own >> > pockets. The whole idea of Tidelift (as I understand it) is for the >> > individuals doing work that is of importance to project users (to whom >> > Tidelift is providing indemnification and "insurance" against defects) >> > Actually that's only partially true. Tidelift is paying for very specific things, that allow them to do aggregated reporting on licensing, dependencies, security vulnerabilities, release streams & release docs, etc. - basically the stuff that helps large corporations do due diligence and management of a large software stack. It is explicitly out of scope to work on bugs or enhancements in the NumFOCUS-Tidelift agreement (and working on particular technical items was never their intention). So "insurance against defects" isn't part of this, except in a very abstract sense of making the project healthier and therefore reducing the risk of it being abandoned or a lot more buggy on the many-year time scale. Cheers, Ralf > to get paid for their labor. So I think the most honest way to use the >> > money is to put it in your respective bank accounts. If you've getting >> > a little bit of money to spend on yourself, doesn't that make doing >> > the maintenance work a bit less thankless? If you don't pay >> > yourselves, I think it actually "breaks" Tidelift's pitch to customers >> > which is that open source projects need to have a higher fraction of >> > compensated maintenance and support work than they do now. >> > >> > How you allocate the money to each other is something you can debate >> privately >> > >> > On Tue, Jun 11, 2019 at 8:42 AM Joris Van den Bossche >> > wrote: >> > > >> > > >> > > >> > > Op di 11 jun. 2019 om 15:31 schreef Ralf Gommers < >> ralf.gommers at gmail.com>: >> > >> >> > >> >> > >> >> > >> On Tue, Jun 11, 2019 at 3:03 PM Tom Augspurger < >> tom.augspurger88 at gmail.com> wrote: >> > >>> >> > >>> >> > >>> >> > >>> On Tue, Jun 11, 2019 at 7:58 AM William Ayd via Pandas-dev < >> pandas-dev at python.org> wrote: >> > >>>> >> > >>>> Just some counterpoints to consider: >> > >>>> >> > >>>> - $ 3,000 a month isn?t really that much, and if it?s just a >> number that a well-funded company chose for us chances are they are >> benefiting from it way more than we are >> > >> >> > >> >> > >> "it's not really that much" is something I don't agree with. It >> doesn't employ someone, but it's enough to pay for things like developer >> meetups, hiring an extra GSoC student if a good one happens to come along, >> paying a web dev for a full redesign of the project website, etc. Each of >> those things is in the $5,000 - %15,000 range, and it's _very_ nice to be >> able to do them without having to look for funding first. >> > >> >> > >> Tidelift is a small (now ~25 employees) company by the way, and they >> have a real understanding of the open source sustainability issues and seem >> dedicated to helping fix it. >> > >> >> > >>>> - There is no such thing as free money; we have to consider how to >> account for and actually manage it (perhaps mitigated somewhat by NumFocus) >> > >>> >> > >>> >> > >>> Perhaps Ralph can share how this has gone for NumPy. I imagine it's >> not too work on their end, thanks to NumFOCUS. >> > >> >> > >> >> > >> NumFOCUS handles receiving the money and associated admin. As the >> project you'll be responsible for the setup and ongoing tasks. For NumPy >> and SciPy I have done those tasks. It's a fairly minimal amount of work: >> https://github.com/numpy/numpy/pulls?q=is%3Apr+tidelift+is%3Aclosed. The >> main one was dealing with GitHub not recognizing our license, and you don't >> have that issue for Pandas (it's reported correctly as BSD-3 in the UI at >> https://github.com/pandas-dev/pandas). >> > >> >> > >> So it's probably a day of work for one person, to get familiar with >> the interface, check dependencies, release streams, paste in release notes, >> etc. And then ongoing maybe one or a couple of hours a month. So far it's >> been a much more effective way of spending time than, for example, grant >> writing. >> > >> >> > >>> >> > >>>> >> > >>>> - Advertising and ties to a corporate sponsorship may weaken the >> brand of pandas; at that point we may lose some creditability as open >> source volunteers >> > >>> >> > >>> >> > >>> Anecdotally, I don't think that's how the community views Tidelift. >> My perception (from Twitter, blogs / comments) is that it's been well >> received. >> > >> >> > >> >> > >> Agree, the feedback I've seen is all quite positive. >> > > >> > > >> > > Additionally, I don't think there is any "advertisement" involved, at >> least not in the classical sense of adding adds for third-party companies >> in a side bar to our website for which we get money. Of course we will need >> to mention Tidelift in some way, e.g. in our sponsors / institutional >> partners section, but we already do that for some other companies as well >> (that employ core devs). >> > > >> > >> >> > >> >> > >>> >> > >>>> >> > >>>> - We don?t (AFAIK) have a plan on how to spend or allocate it >> > >>>> >> > >>>> Not totally against it but perhaps the last point above is the >> main sticking one. Do we have any idea how much we?d actually pocket out of >> the $ 3k they offer us and subsequently what we would do with it? Cover >> travel expenses? Support PyData conferences? Scholarships? >> > >>> >> > >>> >> > >>> Agreed that we should set a purpose for this money (though, I have >> no objection to collecting while we set that dedicated purpose). >> > >> >> > >> >> > > Indeed we need to discuss this, but I don't think we already need to >> know *exactly* what we want to do with it before setting up a contract with >> Tidelift. It's good for me to alraedy start discussing it now, but maybe in >> a separate thread? >> > > >> > >> >> > >> For NumPy and SciPy we haven't earmarked the funds yet. It's nice to >> build up a buffer first. One thing I'm thinking of is that we're >> participating in Google Season of Docs, and are getting more high quality >> applicants than Google will accept. So we could pay one or two tech writers >> from the funds. Our website and high level docs (tutorial, restructuring of >> all docs to guide users better) sure could use it:) >> > >> >> > >> My abstract advice would be: pay for things that require money (like >> a dev meeting) or don't get done for free. Don't pay for writing code >> unless the case is extremely compelling, because that'll be a drop in the >> bucket. >> > >> >> > >> Cheers, >> > >> Ralf >> > >> >> > >> >> > >>> >> > >>>> >> > >>>> - Will >> > >>>> >> > >>>> On Jun 11, 2019, at 4:44 AM, Ralf Gommers >> wrote: >> > >>>> >> > >>>> >> > >>>> >> > >>>> On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> > >>>>> >> > >>>>> The current page about pandas ( >> https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 dollar a >> month (but I am not fully sure this is what is already available from their >> current subscribers, or if it is a prospect). >> > >>>> >> > >>>> >> > >>>> It's not just a prospect, that's what you should/will get. NumPy >> and SciPy get the listed amounts too. >> > >>>> >> > >>>> Agreed that the NumPy amount is not that much. The amount gets >> determined automatically; it's some combination of customer interest, >> dependency analysis and size of the API surface. >> > >>>> >> > >>>> The current amounts are: >> > >>>> NumPy: $1000 >> > >>>> SciPy: $2500 >> > >>>> Pandas: $3000 >> > >>>> Matplotlib: n.a. >> > >>>> Scikit-learn: $1500 >> > >>>> Scikit-image: $50 >> > >>>> Statsmodels: $50 >> > >>>> >> > >>>> So there's an element of randomness, but the results are not >> completely surprising I think. The four libraries that get order thousands >> of dollars are the ones that large corporations are going to have the >> highest interest in. >> > >>>> >> > >>>> Cheers, >> > >>>> Ralf >> > >>>> >> > >>>>> >> > >>>>> >> > >>>>> Op za 8 jun. 2019 om 22:54 schreef William Ayd < >> william.ayd at icloud.com>: >> > >>>>>> >> > >>>>>> What is the minimum amount we are asking for? The $1,000 a month >> for NumPy seems rather low and I thought previous emails had something in >> the range of $3k a month. >> > >>>>>> >> > >>>>>> I don?t think we necessarily need or would be that much improved >> by $12k per year so would rather aim higher if we are going to do this >> > >>>>>> >> > >>>>>> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> > >>>>>> >> > >>>>>> Hi all, >> > >>>>>> >> > >>>>>> We discussed this on the last dev chat, but putting it on the >> mailing list for those who were not present: we are planning to contact >> Tidelift to enter into a sponsor agreement for Pandas. >> > >>>>>> >> > >>>>>> The idea is to follow what NumPy (and recently also Scipy) did >> to have an agreement between Tidelift and NumFOCUS instead of an individual >> maintainer (see their announcement mail: >> https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html >> ). >> > >>>>>> Blog with overview about Tidelift: >> https://blog.tidelift.com/how-to-start-earning-money-for-your-open-source-project-with-tidelift >> . >> > >>>>>> >> > >>>>>> We didn't discuss yet what to do specifically with those funds, >> that should still be discussed in the future. >> > >>>>>> >> > >>>>>> Cheers, >> > >>>>>> Joris >> > >>>>>> _______________________________________________ >> > >>>>>> Pandas-dev mailing list >> > >>>>>> Pandas-dev at python.org >> > >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >> > >>>>>> >> > >>>>>> >> > >>>>> _______________________________________________ >> > >>>>> Pandas-dev mailing list >> > >>>>> Pandas-dev at python.org >> > >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >> > >>>> >> > >>>> >> > >>>> _______________________________________________ >> > >>>> Pandas-dev mailing list >> > >>>> Pandas-dev at python.org >> > >>>> https://mail.python.org/mailman/listinfo/pandas-dev >> > >> >> > >> _______________________________________________ >> > >> Pandas-dev mailing list >> > >> Pandas-dev at python.org >> > >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > >> > > _______________________________________________ >> > > Pandas-dev mailing list >> > > Pandas-dev at python.org >> > > https://mail.python.org/mailman/listinfo/pandas-dev >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Jun 11 13:50:47 2019 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 11 Jun 2019 12:50:47 -0500 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: hi, On Tue, Jun 11, 2019 at 11:16 AM Ralf Gommers wrote: > > > > On Tue, Jun 11, 2019 at 4:56 PM Andy Ray Terrel wrote: >> >> While the original lifter agreement was an individual contract, in our negotiations with Tidelift, NumFOCUS has explicitly sought a model that allows the project to split the money how they prefer. This was always Tidelift's intention, it was just faster and easier to scale to focus on paying individuals. > > > +1 the project decides for themselves is the intent and a good principle. > >> >> I do like the idea of paying for maintence work, I would recommend we set up folks as contractors with NumFOCUS rather than just pocketing money. It will give a lot more legal protection. Then if some folks don't want to take the cash you they can donate their time and be recognized as in-kind donations, which might have some tax deductions. > > > Keep in mind that this has a lot of potential issues. Examples: > 1. Who decides who gets paid, and how? The pandas repo has 1500+ contributors. Lots of potential for friction over small amount of $. More or less the _entire_ point of Tidelift is to incentivize people to do more maintenance work. I think it's worth at least attempting to use this money for its intended economic purpose. The maintainers are, as a first approximation, the ~10-15 active core members listed on https://github.com/pandas-dev/pandas-governance IMHO those are the people that should get paid (going forward) -- if contributors are more motivated to become core team members / maintainers as a result of the Tidelift money, then it has had the desired outcome. > 2. Many people have employment contracts, those typically forbid contracting on the side. So inherently unfair to distribute only to those who are in a position to accept the money. This is true -- at least Jeff and maybe others fall into this category. In such cases their "cut" of the maintenance funds can go into the communal fund to pay for other stuff > 3. You're now introducing lots of extra paperwork and admin, both directly and indirectly (who wants to deal with the extra complications when filing your taxes?). Hopefully we're talking just a 1099 from NumFOCUS with a single number to type in, but I'm the wrong person to judge since my taxes are more complicated than most people's =) > 4. It may create other weird social dynamics. E.g. if money is now directly coupled to a commit bit, that makes the "who do we give commit rights and when" a potentially more loaded question. I think this is where the honest self-reporting of time spent comes in. The goal is to increase the average number of maintainer hours per month/year. It's sort of like a crypto-mining pool, but for open source software maintenance =) Obviously maintainers are accountable to the rest of the core team to behave with integrity (professionalism, honesty, etc.) or they can be voted to be removed if they are found to be dishonest. > > And, dividing it into N chunks, the funding becomes nice beer money and a thank you for volunteering. Could be exactly what you'd prefer as a team. But that's imho more in line with the current version of Patreon or GitHub Sponsors rather then with what Tidelift is aiming for. > > I'd like the idea of "paying for maintenance" if there were enough money to employ people. But realistically, that will take many years. The Tidelift slogan on this is unrealistic for a project like Pandas where maintenance effort is many FTEs; it's perhaps feasible for your typical Javascript library that's popular but small enough for one person maintaining it part-time. > >> >> It is something I would volunteer to help manage in order to learn how other projects might use the same techniques. >> >> -- Andy >> >> On Tue, Jun 11, 2019 at 9:13 AM Wes McKinney wrote: >>> >>> > How you allocate the money to each other is something you can debate privately >>> >>> On this, I'm sure that you could set up a lightweight virtual >>> "timesheet" so you can put yourselves "on the clock" when you're doing >>> project maintenance work (there are many of these online, I just read >>> about https://www.clockspot.com/ recently) to make time reporting a >>> bit more accurate >>> >>> On Tue, Jun 11, 2019 at 9:09 AM Wes McKinney wrote: >>> > >>> > Personally, I would recommend putting most of the money in your own >>> > pockets. The whole idea of Tidelift (as I understand it) is for the >>> > individuals doing work that is of importance to project users (to whom >>> > Tidelift is providing indemnification and "insurance" against defects) > > > Actually that's only partially true. Tidelift is paying for very specific things, that allow them to do aggregated reporting on licensing, dependencies, security vulnerabilities, release streams & release docs, etc. - basically the stuff that helps large corporations do due diligence and management of a large software stack. > > It is explicitly out of scope to work on bugs or enhancements in the NumFOCUS-Tidelift agreement (and working on particular technical items was never their intention). So "insurance against defects" isn't part of this, except in a very abstract sense of making the project healthier and therefore reducing the risk of it being abandoned or a lot more buggy on the many-year time scale. > > Cheers, > Ralf > > >>> > to get paid for their labor. So I think the most honest way to use the >>> > money is to put it in your respective bank accounts. If you've getting >>> > a little bit of money to spend on yourself, doesn't that make doing >>> > the maintenance work a bit less thankless? If you don't pay >>> > yourselves, I think it actually "breaks" Tidelift's pitch to customers >>> > which is that open source projects need to have a higher fraction of >>> > compensated maintenance and support work than they do now. >>> > >>> > How you allocate the money to each other is something you can debate privately >>> > >>> > On Tue, Jun 11, 2019 at 8:42 AM Joris Van den Bossche >>> > wrote: >>> > > >>> > > >>> > > >>> > > Op di 11 jun. 2019 om 15:31 schreef Ralf Gommers : >>> > >> >>> > >> >>> > >> >>> > >> On Tue, Jun 11, 2019 at 3:03 PM Tom Augspurger wrote: >>> > >>> >>> > >>> >>> > >>> >>> > >>> On Tue, Jun 11, 2019 at 7:58 AM William Ayd via Pandas-dev wrote: >>> > >>>> >>> > >>>> Just some counterpoints to consider: >>> > >>>> >>> > >>>> - $ 3,000 a month isn?t really that much, and if it?s just a number that a well-funded company chose for us chances are they are benefiting from it way more than we are >>> > >> >>> > >> >>> > >> "it's not really that much" is something I don't agree with. It doesn't employ someone, but it's enough to pay for things like developer meetups, hiring an extra GSoC student if a good one happens to come along, paying a web dev for a full redesign of the project website, etc. Each of those things is in the $5,000 - %15,000 range, and it's _very_ nice to be able to do them without having to look for funding first. >>> > >> >>> > >> Tidelift is a small (now ~25 employees) company by the way, and they have a real understanding of the open source sustainability issues and seem dedicated to helping fix it. >>> > >> >>> > >>>> - There is no such thing as free money; we have to consider how to account for and actually manage it (perhaps mitigated somewhat by NumFocus) >>> > >>> >>> > >>> >>> > >>> Perhaps Ralph can share how this has gone for NumPy. I imagine it's not too work on their end, thanks to NumFOCUS. >>> > >> >>> > >> >>> > >> NumFOCUS handles receiving the money and associated admin. As the project you'll be responsible for the setup and ongoing tasks. For NumPy and SciPy I have done those tasks. It's a fairly minimal amount of work: https://github.com/numpy/numpy/pulls?q=is%3Apr+tidelift+is%3Aclosed. The main one was dealing with GitHub not recognizing our license, and you don't have that issue for Pandas (it's reported correctly as BSD-3 in the UI at https://github.com/pandas-dev/pandas). >>> > >> >>> > >> So it's probably a day of work for one person, to get familiar with the interface, check dependencies, release streams, paste in release notes, etc. And then ongoing maybe one or a couple of hours a month. So far it's been a much more effective way of spending time than, for example, grant writing. >>> > >> >>> > >>> >>> > >>>> >>> > >>>> - Advertising and ties to a corporate sponsorship may weaken the brand of pandas; at that point we may lose some creditability as open source volunteers >>> > >>> >>> > >>> >>> > >>> Anecdotally, I don't think that's how the community views Tidelift. My perception (from Twitter, blogs / comments) is that it's been well received. >>> > >> >>> > >> >>> > >> Agree, the feedback I've seen is all quite positive. >>> > > >>> > > >>> > > Additionally, I don't think there is any "advertisement" involved, at least not in the classical sense of adding adds for third-party companies in a side bar to our website for which we get money. Of course we will need to mention Tidelift in some way, e.g. in our sponsors / institutional partners section, but we already do that for some other companies as well (that employ core devs). >>> > > >>> > >> >>> > >> >>> > >>> >>> > >>>> >>> > >>>> - We don?t (AFAIK) have a plan on how to spend or allocate it >>> > >>>> >>> > >>>> Not totally against it but perhaps the last point above is the main sticking one. Do we have any idea how much we?d actually pocket out of the $ 3k they offer us and subsequently what we would do with it? Cover travel expenses? Support PyData conferences? Scholarships? >>> > >>> >>> > >>> >>> > >>> Agreed that we should set a purpose for this money (though, I have no objection to collecting while we set that dedicated purpose). >>> > >> >>> > >> >>> > > Indeed we need to discuss this, but I don't think we already need to know *exactly* what we want to do with it before setting up a contract with Tidelift. It's good for me to alraedy start discussing it now, but maybe in a separate thread? >>> > > >>> > >> >>> > >> For NumPy and SciPy we haven't earmarked the funds yet. It's nice to build up a buffer first. One thing I'm thinking of is that we're participating in Google Season of Docs, and are getting more high quality applicants than Google will accept. So we could pay one or two tech writers from the funds. Our website and high level docs (tutorial, restructuring of all docs to guide users better) sure could use it:) >>> > >> >>> > >> My abstract advice would be: pay for things that require money (like a dev meeting) or don't get done for free. Don't pay for writing code unless the case is extremely compelling, because that'll be a drop in the bucket. >>> > >> >>> > >> Cheers, >>> > >> Ralf >>> > >> >>> > >> >>> > >>> >>> > >>>> >>> > >>>> - Will >>> > >>>> >>> > >>>> On Jun 11, 2019, at 4:44 AM, Ralf Gommers wrote: >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche wrote: >>> > >>>>> >>> > >>>>> The current page about pandas (https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 dollar a month (but I am not fully sure this is what is already available from their current subscribers, or if it is a prospect). >>> > >>>> >>> > >>>> >>> > >>>> It's not just a prospect, that's what you should/will get. NumPy and SciPy get the listed amounts too. >>> > >>>> >>> > >>>> Agreed that the NumPy amount is not that much. The amount gets determined automatically; it's some combination of customer interest, dependency analysis and size of the API surface. >>> > >>>> >>> > >>>> The current amounts are: >>> > >>>> NumPy: $1000 >>> > >>>> SciPy: $2500 >>> > >>>> Pandas: $3000 >>> > >>>> Matplotlib: n.a. >>> > >>>> Scikit-learn: $1500 >>> > >>>> Scikit-image: $50 >>> > >>>> Statsmodels: $50 >>> > >>>> >>> > >>>> So there's an element of randomness, but the results are not completely surprising I think. The four libraries that get order thousands of dollars are the ones that large corporations are going to have the highest interest in. >>> > >>>> >>> > >>>> Cheers, >>> > >>>> Ralf >>> > >>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> Op za 8 jun. 2019 om 22:54 schreef William Ayd : >>> > >>>>>> >>> > >>>>>> What is the minimum amount we are asking for? The $1,000 a month for NumPy seems rather low and I thought previous emails had something in the range of $3k a month. >>> > >>>>>> >>> > >>>>>> I don?t think we necessarily need or would be that much improved by $12k per year so would rather aim higher if we are going to do this >>> > >>>>>> >>> > >>>>>> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche wrote: >>> > >>>>>> >>> > >>>>>> Hi all, >>> > >>>>>> >>> > >>>>>> We discussed this on the last dev chat, but putting it on the mailing list for those who were not present: we are planning to contact Tidelift to enter into a sponsor agreement for Pandas. >>> > >>>>>> >>> > >>>>>> The idea is to follow what NumPy (and recently also Scipy) did to have an agreement between Tidelift and NumFOCUS instead of an individual maintainer (see their announcement mail: https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html). >>> > >>>>>> Blog with overview about Tidelift: https://blog.tidelift.com/how-to-start-earning-money-for-your-open-source-project-with-tidelift. >>> > >>>>>> >>> > >>>>>> We didn't discuss yet what to do specifically with those funds, that should still be discussed in the future. >>> > >>>>>> >>> > >>>>>> Cheers, >>> > >>>>>> Joris >>> > >>>>>> _______________________________________________ >>> > >>>>>> Pandas-dev mailing list >>> > >>>>>> Pandas-dev at python.org >>> > >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>> > >>>>>> >>> > >>>>>> >>> > >>>>> _______________________________________________ >>> > >>>>> Pandas-dev mailing list >>> > >>>>> Pandas-dev at python.org >>> > >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>> > >>>> >>> > >>>> >>> > >>>> _______________________________________________ >>> > >>>> Pandas-dev mailing list >>> > >>>> Pandas-dev at python.org >>> > >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>> > >> >>> > >> _______________________________________________ >>> > >> Pandas-dev mailing list >>> > >> Pandas-dev at python.org >>> > >> https://mail.python.org/mailman/listinfo/pandas-dev >>> > > >>> > > _______________________________________________ >>> > > Pandas-dev mailing list >>> > > Pandas-dev at python.org >>> > > https://mail.python.org/mailman/listinfo/pandas-dev >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev From ralf.gommers at gmail.com Tue Jun 11 15:19:13 2019 From: ralf.gommers at gmail.com (Ralf Gommers) Date: Tue, 11 Jun 2019 21:19:13 +0200 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: On Tue, Jun 11, 2019 at 7:51 PM Wes McKinney wrote: > hi, > > On Tue, Jun 11, 2019 at 11:16 AM Ralf Gommers > wrote: > > > > > > > > On Tue, Jun 11, 2019 at 4:56 PM Andy Ray Terrel > wrote: > >> > >> While the original lifter agreement was an individual contract, in our > negotiations with Tidelift, NumFOCUS has explicitly sought a model that > allows the project to split the money how they prefer. This was always > Tidelift's intention, it was just faster and easier to scale to focus on > paying individuals. > > > > > > +1 the project decides for themselves is the intent and a good principle. > > > >> > >> I do like the idea of paying for maintence work, I would recommend we > set up folks as contractors with NumFOCUS rather than just pocketing money. > It will give a lot more legal protection. Then if some folks don't want to > take the cash you they can donate their time and be recognized as in-kind > donations, which might have some tax deductions. > > > > > > Keep in mind that this has a lot of potential issues. Examples: > > 1. Who decides who gets paid, and how? The pandas repo has 1500+ > contributors. Lots of potential for friction over small amount of $. > > More or less the _entire_ point of Tidelift is to incentivize people > to do more maintenance work. I think it's worth at least attempting to > use this money for its intended economic purpose. > You do realize that the other topics I suggested using money for are also maintenance work right? Even if your sole goal is "get more maintenance work done", spending funds with purpose is likely to be more effective than just putting it in your own pockets. > The maintainers are, as a first approximation, the ~10-15 active core > members listed on > > https://github.com/pandas-dev/pandas-governance > > IMHO those are the people that should get paid (going forward) -- if > contributors are more motivated to become core team members / > maintainers as a result of the Tidelift money, then it has had the > desired outcome. > > > 2. Many people have employment contracts, those typically forbid > contracting on the side. So inherently unfair to distribute only to those > who are in a position to accept the money. > > This is true -- at least Jeff and maybe others fall into this > category. In such cases their "cut" of the maintenance funds can go > into the communal fund to pay for other stuff > That does not really seem fair. There must be better options. E.g. for anyone in such a position, you could use the money to pay for travel and hotel if they go to a dev meeting or conference. People can't accept money as income in many cases, but everyone is usually able to get cost reimbursements or accept a free ticket. And you don't have to pay income tax over that, so the $ goes a lot further. > > 3. You're now introducing lots of extra paperwork and admin, both > directly and indirectly (who wants to deal with the extra complications > when filing your taxes?). > > Hopefully we're talking just a 1099 from NumFOCUS with a single number > to type in, but I'm the wrong person to judge since my taxes are more > complicated than most people's =) > There's a lot more to it than typing in a single number if you go, e.g., from 1 to 2 sources of income. Also, NumFOCUS won't be able to give any kind of paperwork for people outside the US. It'll all be up to them to do correct tax reporting/withholding. --- tl;dr this is a complicated topic, it's worth thinking about and making informed choices that maximize the benefits and minimize the costs rather than a simple "just put it in your pockets". Now I'm not a Pandas dev, I just helped with getting the original NumFOCUS-Tidelift agreement in place and wanted to share my experiences with Tidelift. So I'll bow out here. Cheers, Ralf > > 4. It may create other weird social dynamics. E.g. if money is now > directly coupled to a commit bit, that makes the "who do we give commit > rights and when" a potentially more loaded question. > > I think this is where the honest self-reporting of time spent comes > in. The goal is to increase the average number of maintainer hours per > month/year. It's sort of like a crypto-mining pool, but for open > source software maintenance =) Obviously maintainers are accountable > to the rest of the core team to behave with integrity > (professionalism, honesty, etc.) or they can be voted to be removed if > they are found to be dishonest. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andy.terrel at gmail.com Tue Jun 11 15:25:45 2019 From: andy.terrel at gmail.com (Andy Ray Terrel) Date: Tue, 11 Jun 2019 14:25:45 -0500 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: On Tue, Jun 11, 2019 at 12:51 PM Wes McKinney wrote: > hi, > > On Tue, Jun 11, 2019 at 11:16 AM Ralf Gommers > wrote: > > > > > > > > On Tue, Jun 11, 2019 at 4:56 PM Andy Ray Terrel > wrote: > >> > >> While the original lifter agreement was an individual contract, in our > negotiations with Tidelift, NumFOCUS has explicitly sought a model that > allows the project to split the money how they prefer. This was always > Tidelift's intention, it was just faster and easier to scale to focus on > paying individuals. > > > > > > +1 the project decides for themselves is the intent and a good principle. > > > >> > >> I do like the idea of paying for maintence work, I would recommend we > set up folks as contractors with NumFOCUS rather than just pocketing money. > It will give a lot more legal protection. Then if some folks don't want to > take the cash you they can donate their time and be recognized as in-kind > donations, which might have some tax deductions. > > > > > > Keep in mind that this has a lot of potential issues. Examples: > > 1. Who decides who gets paid, and how? The pandas repo has 1500+ > contributors. Lots of potential for friction over small amount of $. > > More or less the _entire_ point of Tidelift is to incentivize people > to do more maintenance work. I think it's worth at least attempting to > use this money for its intended economic purpose. > > The maintainers are, as a first approximation, the ~10-15 active core > members listed on > > https://github.com/pandas-dev/pandas-governance > > IMHO those are the people that should get paid (going forward) -- if > contributors are more motivated to become core team members / > maintainers as a result of the Tidelift money, then it has had the > desired outcome. > I would suggest leaving the decision to the project core team with the project Numfocus committee to be the overseer of the implementation. > > > 2. Many people have employment contracts, those typically forbid > contracting on the side. So inherently unfair to distribute only to those > who are in a position to accept the money. > > This is true -- at least Jeff and maybe others fall into this > category. In such cases their "cut" of the maintenance funds can go > into the communal fund to pay for other stuff > > Yes such accommodation will need to be worked out. > > 3. You're now introducing lots of extra paperwork and admin, both > directly and indirectly (who wants to deal with the extra complications > when filing your taxes?). > > Hopefully we're talking just a 1099 from NumFOCUS with a single number > to type in, but I'm the wrong person to judge since my taxes are more > complicated than most people's =) > Generally it is done that way for US based folks and for folks out of the US we tend to let them handle their own taxes. We would need to work that out. Additionally, as in all dealings with businesses, we do the extra paperwork for the other benefits such as limiting the liability of a maintainer. > > > 4. It may create other weird social dynamics. E.g. if money is now > directly coupled to a commit bit, that makes the "who do we give commit > rights and when" a potentially more loaded question. > > I think this is where the honest self-reporting of time spent comes > in. The goal is to increase the average number of maintainer hours per > month/year. It's sort of like a crypto-mining pool, but for open > source software maintenance =) Obviously maintainers are accountable > to the rest of the core team to behave with integrity > (professionalism, honesty, etc.) or they can be voted to be removed if > they are found to be dishonest. > > > > And, dividing it into N chunks, the funding becomes nice beer money and a > thank you for volunteering. Could be exactly what you'd prefer as a team. > But that's imho more in line with the current version of Patreon or GitHub > Sponsors rather then with what Tidelift is aiming for. > > > > I'd like the idea of "paying for maintenance" if there were enough money > to employ people. But realistically, that will take many years. The > Tidelift slogan on this is unrealistic for a project like Pandas where > maintenance effort is many FTEs; it's perhaps feasible for your typical > Javascript library that's popular but small enough for one person > maintaining it part-time. > > > >> > >> It is something I would volunteer to help manage in order to learn how > other projects might use the same techniques. > >> > >> -- Andy > >> > >> On Tue, Jun 11, 2019 at 9:13 AM Wes McKinney > wrote: > >>> > >>> > How you allocate the money to each other is something you can debate > privately > >>> > >>> On this, I'm sure that you could set up a lightweight virtual > >>> "timesheet" so you can put yourselves "on the clock" when you're doing > >>> project maintenance work (there are many of these online, I just read > >>> about https://www.clockspot.com/ recently) to make time reporting a > >>> bit more accurate > >>> > >>> On Tue, Jun 11, 2019 at 9:09 AM Wes McKinney > wrote: > >>> > > >>> > Personally, I would recommend putting most of the money in your own > >>> > pockets. The whole idea of Tidelift (as I understand it) is for the > >>> > individuals doing work that is of importance to project users (to > whom > >>> > Tidelift is providing indemnification and "insurance" against > defects) > > > > > > Actually that's only partially true. Tidelift is paying for very > specific things, that allow them to do aggregated reporting on licensing, > dependencies, security vulnerabilities, release streams & release docs, > etc. - basically the stuff that helps large corporations do due diligence > and management of a large software stack. > > > > It is explicitly out of scope to work on bugs or enhancements in the > NumFOCUS-Tidelift agreement (and working on particular technical items was > never their intention). So "insurance against defects" isn't part of this, > except in a very abstract sense of making the project healthier and > therefore reducing the risk of it being abandoned or a lot more buggy on > the many-year time scale. > > > > Cheers, > > Ralf > > > > > >>> > to get paid for their labor. So I think the most honest way to use > the > >>> > money is to put it in your respective bank accounts. If you've > getting > >>> > a little bit of money to spend on yourself, doesn't that make doing > >>> > the maintenance work a bit less thankless? If you don't pay > >>> > yourselves, I think it actually "breaks" Tidelift's pitch to > customers > >>> > which is that open source projects need to have a higher fraction of > >>> > compensated maintenance and support work than they do now. > >>> > > >>> > How you allocate the money to each other is something you can debate > privately > >>> > > >>> > On Tue, Jun 11, 2019 at 8:42 AM Joris Van den Bossche > >>> > wrote: > >>> > > > >>> > > > >>> > > > >>> > > Op di 11 jun. 2019 om 15:31 schreef Ralf Gommers < > ralf.gommers at gmail.com>: > >>> > >> > >>> > >> > >>> > >> > >>> > >> On Tue, Jun 11, 2019 at 3:03 PM Tom Augspurger < > tom.augspurger88 at gmail.com> wrote: > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Tue, Jun 11, 2019 at 7:58 AM William Ayd via Pandas-dev < > pandas-dev at python.org> wrote: > >>> > >>>> > >>> > >>>> Just some counterpoints to consider: > >>> > >>>> > >>> > >>>> - $ 3,000 a month isn?t really that much, and if it?s just a > number that a well-funded company chose for us chances are they are > benefiting from it way more than we are > >>> > >> > >>> > >> > >>> > >> "it's not really that much" is something I don't agree with. It > doesn't employ someone, but it's enough to pay for things like developer > meetups, hiring an extra GSoC student if a good one happens to come along, > paying a web dev for a full redesign of the project website, etc. Each of > those things is in the $5,000 - %15,000 range, and it's _very_ nice to be > able to do them without having to look for funding first. > >>> > >> > >>> > >> Tidelift is a small (now ~25 employees) company by the way, and > they have a real understanding of the open source sustainability issues and > seem dedicated to helping fix it. > >>> > >> > >>> > >>>> - There is no such thing as free money; we have to consider how > to account for and actually manage it (perhaps mitigated somewhat by > NumFocus) > >>> > >>> > >>> > >>> > >>> > >>> Perhaps Ralph can share how this has gone for NumPy. I imagine > it's not too work on their end, thanks to NumFOCUS. > >>> > >> > >>> > >> > >>> > >> NumFOCUS handles receiving the money and associated admin. As the > project you'll be responsible for the setup and ongoing tasks. For NumPy > and SciPy I have done those tasks. It's a fairly minimal amount of work: > https://github.com/numpy/numpy/pulls?q=is%3Apr+tidelift+is%3Aclosed. The > main one was dealing with GitHub not recognizing our license, and you don't > have that issue for Pandas (it's reported correctly as BSD-3 in the UI at > https://github.com/pandas-dev/pandas). > >>> > >> > >>> > >> So it's probably a day of work for one person, to get familiar > with the interface, check dependencies, release streams, paste in release > notes, etc. And then ongoing maybe one or a couple of hours a month. So far > it's been a much more effective way of spending time than, for example, > grant writing. > >>> > >> > >>> > >>> > >>> > >>>> > >>> > >>>> - Advertising and ties to a corporate sponsorship may weaken > the brand of pandas; at that point we may lose some creditability as open > source volunteers > >>> > >>> > >>> > >>> > >>> > >>> Anecdotally, I don't think that's how the community views > Tidelift. My perception (from Twitter, blogs / comments) is that it's been > well received. > >>> > >> > >>> > >> > >>> > >> Agree, the feedback I've seen is all quite positive. > >>> > > > >>> > > > >>> > > Additionally, I don't think there is any "advertisement" involved, > at least not in the classical sense of adding adds for third-party > companies in a side bar to our website for which we get money. Of course we > will need to mention Tidelift in some way, e.g. in our sponsors / > institutional partners section, but we already do that for some other > companies as well (that employ core devs). > >>> > > > >>> > >> > >>> > >> > >>> > >>> > >>> > >>>> > >>> > >>>> - We don?t (AFAIK) have a plan on how to spend or allocate it > >>> > >>>> > >>> > >>>> Not totally against it but perhaps the last point above is the > main sticking one. Do we have any idea how much we?d actually pocket out of > the $ 3k they offer us and subsequently what we would do with it? Cover > travel expenses? Support PyData conferences? Scholarships? > >>> > >>> > >>> > >>> > >>> > >>> Agreed that we should set a purpose for this money (though, I > have no objection to collecting while we set that dedicated purpose). > >>> > >> > >>> > >> > >>> > > Indeed we need to discuss this, but I don't think we already need > to know *exactly* what we want to do with it before setting up a contract > with Tidelift. It's good for me to alraedy start discussing it now, but > maybe in a separate thread? > >>> > > > >>> > >> > >>> > >> For NumPy and SciPy we haven't earmarked the funds yet. It's nice > to build up a buffer first. One thing I'm thinking of is that we're > participating in Google Season of Docs, and are getting more high quality > applicants than Google will accept. So we could pay one or two tech writers > from the funds. Our website and high level docs (tutorial, restructuring of > all docs to guide users better) sure could use it:) > >>> > >> > >>> > >> My abstract advice would be: pay for things that require money > (like a dev meeting) or don't get done for free. Don't pay for writing code > unless the case is extremely compelling, because that'll be a drop in the > bucket. > >>> > >> > >>> > >> Cheers, > >>> > >> Ralf > >>> > >> > >>> > >> > >>> > >>> > >>> > >>>> > >>> > >>>> - Will > >>> > >>>> > >>> > >>>> On Jun 11, 2019, at 4:44 AM, Ralf Gommers < > ralf.gommers at gmail.com> wrote: > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >>> > >>>>> > >>> > >>>>> The current page about pandas ( > https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 dollar a > month (but I am not fully sure this is what is already available from their > current subscribers, or if it is a prospect). > >>> > >>>> > >>> > >>>> > >>> > >>>> It's not just a prospect, that's what you should/will get. > NumPy and SciPy get the listed amounts too. > >>> > >>>> > >>> > >>>> Agreed that the NumPy amount is not that much. The amount gets > determined automatically; it's some combination of customer interest, > dependency analysis and size of the API surface. > >>> > >>>> > >>> > >>>> The current amounts are: > >>> > >>>> NumPy: $1000 > >>> > >>>> SciPy: $2500 > >>> > >>>> Pandas: $3000 > >>> > >>>> Matplotlib: n.a. > >>> > >>>> Scikit-learn: $1500 > >>> > >>>> Scikit-image: $50 > >>> > >>>> Statsmodels: $50 > >>> > >>>> > >>> > >>>> So there's an element of randomness, but the results are not > completely surprising I think. The four libraries that get order thousands > of dollars are the ones that large corporations are going to have the > highest interest in. > >>> > >>>> > >>> > >>>> Cheers, > >>> > >>>> Ralf > >>> > >>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> Op za 8 jun. 2019 om 22:54 schreef William Ayd < > william.ayd at icloud.com>: > >>> > >>>>>> > >>> > >>>>>> What is the minimum amount we are asking for? The $1,000 a > month for NumPy seems rather low and I thought previous emails had > something in the range of $3k a month. > >>> > >>>>>> > >>> > >>>>>> I don?t think we necessarily need or would be that much > improved by $12k per year so would rather aim higher if we are going to do > this > >>> > >>>>>> > >>> > >>>>>> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >>> > >>>>>> > >>> > >>>>>> Hi all, > >>> > >>>>>> > >>> > >>>>>> We discussed this on the last dev chat, but putting it on the > mailing list for those who were not present: we are planning to contact > Tidelift to enter into a sponsor agreement for Pandas. > >>> > >>>>>> > >>> > >>>>>> The idea is to follow what NumPy (and recently also Scipy) > did to have an agreement between Tidelift and NumFOCUS instead of an > individual maintainer (see their announcement mail: > https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html > ). > >>> > >>>>>> Blog with overview about Tidelift: > https://blog.tidelift.com/how-to-start-earning-money-for-your-open-source-project-with-tidelift > . > >>> > >>>>>> > >>> > >>>>>> We didn't discuss yet what to do specifically with those > funds, that should still be discussed in the future. > >>> > >>>>>> > >>> > >>>>>> Cheers, > >>> > >>>>>> Joris > >>> > >>>>>> _______________________________________________ > >>> > >>>>>> Pandas-dev mailing list > >>> > >>>>>> Pandas-dev at python.org > >>> > >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>> > >>>>>> > >>> > >>>>>> > >>> > >>>>> _______________________________________________ > >>> > >>>>> Pandas-dev mailing list > >>> > >>>>> Pandas-dev at python.org > >>> > >>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>> > >>>> > >>> > >>>> > >>> > >>>> _______________________________________________ > >>> > >>>> Pandas-dev mailing list > >>> > >>>> Pandas-dev at python.org > >>> > >>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>> > >> > >>> > >> _______________________________________________ > >>> > >> Pandas-dev mailing list > >>> > >> Pandas-dev at python.org > >>> > >> https://mail.python.org/mailman/listinfo/pandas-dev > >>> > > > >>> > > _______________________________________________ > >>> > > Pandas-dev mailing list > >>> > > Pandas-dev at python.org > >>> > > https://mail.python.org/mailman/listinfo/pandas-dev > >>> _______________________________________________ > >>> Pandas-dev mailing list > >>> Pandas-dev at python.org > >>> https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Tue Jun 11 16:38:18 2019 From: jbrockmendel at gmail.com (Brock Mendel) Date: Tue, 11 Jun 2019 15:38:18 -0500 Subject: [Pandas-dev] Arithmetic Proposal Message-ID: I've been working on arithmetic/comparison bugs and more recently on performance problems caused by fixing some of those bugs. After trying less-invasive approaches, I've concluded a fairly big fix is called for. This is an RFC for that proposed fix. ------ In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by making DataFrame arithmetic ops operate column-by-column, dispatching to the Series implementations. This led to a significant performance hit for operations on DataFrames with many columns (#24990, #26061). To restore the lost performance, we need to have these operations take place at the Block level. To prevent DataFrame behavior from diverging from Series behavior (again), we need to retain a single shared implementation. This is a proposal for how meet these two needs. Proposal: - Allow EA to support 2D arrays - Use PandasArray to back Block subclasses currently backed by ndarray - Implement arithmetic and comparison ops directly on PandasArray, then have Series, DataFrame, and Index ops pass through to the PandasArray implementations. Fixes: - Performance degradation in DataFrame ops (#24990, #26061) - The last remaining inconsistencies between Index and Series ops (#19322, #18824) - Most of the xfailing arithmetic tests - #22120: Transposing dataframe loses dtype and ExtensionArray - #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray has no reshape - #23925 DataFrame Quantile Broken with Datetime Data Other Upsides: - Series constructor could dispatch to pd.array, de-duplicating a lot of code. - Easier to move to Arrow backend if Blocks are numpy-naive. - Make EA closer to a drop-in replacement for np.ndarray, necessary if we want e.g. xarray to find them directly useful (#24716, #24583) - Block/BlockManager simplifications, see below. Downsides: - Existing constructors assume 1D - Existing downstream authors assume 1D - Reduction ops (of which there aren't many) don't have axis kwarg ATM - But for PandasArray they just pass through to nanops, which already have+test the axis kwargs - For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one implementing the reductions and am OK with this extra complication. Block Simplifications: - Blocks have three main attributes: values, mgr_locs, and ndim - ndim is _usually_ the same as values.ndim, the exceptions being for cases where type(values) is restricted to 1D - Without these restrictions, we can get rid of: - Block.ndim, associated kludgy ndim-checking code - numerous can-this-be-reshaped/transposed checks and special cases in Block and BlockManager code (which are buggy anyway, e.g. #23925) - With ndim gone, we can then get rid of mgr_locs! - The blocks themselves never use mgr_locs except when passing to their own constructors. - mgr_locs makes _much_ more sense as an attribute of the BlockManager - With mgr_locs gone, Block becomes just a thin wrapper around an EA Implementation Strategy: - Remove the 1D restriction - Fairly small tweak, EA subclass must define `shape` instead of `__len__`; other attrs define in terms of shape. - Define `transpose`, `T`, `reshape`, and `ravel` - With this done, several tasks can proceed in parallel: - simplifications in core.internals, as special-cases for 1D-only can be removed - implement and test arithmetic ops on PandasArray - back Blocks with PandasArray - back Index (and numeric subclasses) with PandasArray - Change DataFrame, Series, Index ops to pass through to underlying Blocks/PandasArrays -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Jun 11 17:09:36 2019 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 11 Jun 2019 16:09:36 -0500 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: On Tue, Jun 11, 2019 at 2:26 PM Andy Ray Terrel wrote: > > > > On Tue, Jun 11, 2019 at 12:51 PM Wes McKinney wrote: >> >> hi, >> >> On Tue, Jun 11, 2019 at 11:16 AM Ralf Gommers wrote: >> > >> > >> > >> > On Tue, Jun 11, 2019 at 4:56 PM Andy Ray Terrel wrote: >> >> >> >> While the original lifter agreement was an individual contract, in our negotiations with Tidelift, NumFOCUS has explicitly sought a model that allows the project to split the money how they prefer. This was always Tidelift's intention, it was just faster and easier to scale to focus on paying individuals. >> > >> > >> > +1 the project decides for themselves is the intent and a good principle. >> > >> >> >> >> I do like the idea of paying for maintence work, I would recommend we set up folks as contractors with NumFOCUS rather than just pocketing money. It will give a lot more legal protection. Then if some folks don't want to take the cash you they can donate their time and be recognized as in-kind donations, which might have some tax deductions. >> > >> > >> > Keep in mind that this has a lot of potential issues. Examples: >> > 1. Who decides who gets paid, and how? The pandas repo has 1500+ contributors. Lots of potential for friction over small amount of $. >> >> More or less the _entire_ point of Tidelift is to incentivize people >> to do more maintenance work. I think it's worth at least attempting to >> use this money for its intended economic purpose. >> >> The maintainers are, as a first approximation, the ~10-15 active core >> members listed on >> >> https://github.com/pandas-dev/pandas-governance >> >> IMHO those are the people that should get paid (going forward) -- if >> contributors are more motivated to become core team members / >> maintainers as a result of the Tidelift money, then it has had the >> desired outcome. > > > I would suggest leaving the decision to the project core team with the project Numfocus committee to be the overseer of the implementation. > Yes, of course, that's the governance that we have in place. I am just stressing that we should try to honor the intent of the asset that is being purchased by Tidelift customers. Tidelift is telling their customers that the money they are paying is going to end up in the pockets of the project maintainers https://tidelift.com/about/lifter If the pandas core team wishes to deny themselves the income (which, divided up, isn't going to be a life-changing amount of money) that's their prerogative -- I just wanted to be clear about where I stand on it, and there's nothing immoral about wanting to be compensated for one's time (given how much volunteered time has already gone uncompensated). One risk to be aware of is that if a high profile project like pandas take's TL's money and none of the maintainers pay themselves with it, then the monthly number may not have as much of a chance of increasing (since current or prospective TL customers may observe that the subscription dollars aren't being used in the way that is being pitched). >> >> >> > 2. Many people have employment contracts, those typically forbid contracting on the side. So inherently unfair to distribute only to those who are in a position to accept the money. >> >> This is true -- at least Jeff and maybe others fall into this >> category. In such cases their "cut" of the maintenance funds can go >> into the communal fund to pay for other stuff >> > > Yes such accommodation will need to be worked out. > >> >> > 3. You're now introducing lots of extra paperwork and admin, both directly and indirectly (who wants to deal with the extra complications when filing your taxes?). >> >> Hopefully we're talking just a 1099 from NumFOCUS with a single number >> to type in, but I'm the wrong person to judge since my taxes are more >> complicated than most people's =) > > > Generally it is done that way for US based folks and for folks out of the US we tend to let them handle their own taxes. We would need to work that out. > > Additionally, as in all dealings with businesses, we do the extra paperwork for the other benefits such as limiting the liability of a maintainer. > >> >> >> > 4. It may create other weird social dynamics. E.g. if money is now directly coupled to a commit bit, that makes the "who do we give commit rights and when" a potentially more loaded question. >> >> I think this is where the honest self-reporting of time spent comes >> in. The goal is to increase the average number of maintainer hours per >> month/year. It's sort of like a crypto-mining pool, but for open >> source software maintenance =) Obviously maintainers are accountable >> to the rest of the core team to behave with integrity >> (professionalism, honesty, etc.) or they can be voted to be removed if >> they are found to be dishonest. > > >> >> > >> >> > And, dividing it into N chunks, the funding becomes nice beer money and a thank you for volunteering. Could be exactly what you'd prefer as a team. But that's imho more in line with the current version of Patreon or GitHub Sponsors rather then with what Tidelift is aiming for. >> > >> > I'd like the idea of "paying for maintenance" if there were enough money to employ people. But realistically, that will take many years. The Tidelift slogan on this is unrealistic for a project like Pandas where maintenance effort is many FTEs; it's perhaps feasible for your typical Javascript library that's popular but small enough for one person maintaining it part-time. >> > >> >> >> >> It is something I would volunteer to help manage in order to learn how other projects might use the same techniques. >> >> >> >> -- Andy >> >> >> >> On Tue, Jun 11, 2019 at 9:13 AM Wes McKinney wrote: >> >>> >> >>> > How you allocate the money to each other is something you can debate privately >> >>> >> >>> On this, I'm sure that you could set up a lightweight virtual >> >>> "timesheet" so you can put yourselves "on the clock" when you're doing >> >>> project maintenance work (there are many of these online, I just read >> >>> about https://www.clockspot.com/ recently) to make time reporting a >> >>> bit more accurate >> >>> >> >>> On Tue, Jun 11, 2019 at 9:09 AM Wes McKinney wrote: >> >>> > >> >>> > Personally, I would recommend putting most of the money in your own >> >>> > pockets. The whole idea of Tidelift (as I understand it) is for the >> >>> > individuals doing work that is of importance to project users (to whom >> >>> > Tidelift is providing indemnification and "insurance" against defects) >> > >> > >> > Actually that's only partially true. Tidelift is paying for very specific things, that allow them to do aggregated reporting on licensing, dependencies, security vulnerabilities, release streams & release docs, etc. - basically the stuff that helps large corporations do due diligence and management of a large software stack. >> > >> > It is explicitly out of scope to work on bugs or enhancements in the NumFOCUS-Tidelift agreement (and working on particular technical items was never their intention). So "insurance against defects" isn't part of this, except in a very abstract sense of making the project healthier and therefore reducing the risk of it being abandoned or a lot more buggy on the many-year time scale. >> > >> > Cheers, >> > Ralf >> > >> > >> >>> > to get paid for their labor. So I think the most honest way to use the >> >>> > money is to put it in your respective bank accounts. If you've getting >> >>> > a little bit of money to spend on yourself, doesn't that make doing >> >>> > the maintenance work a bit less thankless? If you don't pay >> >>> > yourselves, I think it actually "breaks" Tidelift's pitch to customers >> >>> > which is that open source projects need to have a higher fraction of >> >>> > compensated maintenance and support work than they do now. >> >>> > >> >>> > How you allocate the money to each other is something you can debate privately >> >>> > >> >>> > On Tue, Jun 11, 2019 at 8:42 AM Joris Van den Bossche >> >>> > wrote: >> >>> > > >> >>> > > >> >>> > > >> >>> > > Op di 11 jun. 2019 om 15:31 schreef Ralf Gommers : >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> On Tue, Jun 11, 2019 at 3:03 PM Tom Augspurger wrote: >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> On Tue, Jun 11, 2019 at 7:58 AM William Ayd via Pandas-dev wrote: >> >>> > >>>> >> >>> > >>>> Just some counterpoints to consider: >> >>> > >>>> >> >>> > >>>> - $ 3,000 a month isn?t really that much, and if it?s just a number that a well-funded company chose for us chances are they are benefiting from it way more than we are >> >>> > >> >> >>> > >> >> >>> > >> "it's not really that much" is something I don't agree with. It doesn't employ someone, but it's enough to pay for things like developer meetups, hiring an extra GSoC student if a good one happens to come along, paying a web dev for a full redesign of the project website, etc. Each of those things is in the $5,000 - %15,000 range, and it's _very_ nice to be able to do them without having to look for funding first. >> >>> > >> >> >>> > >> Tidelift is a small (now ~25 employees) company by the way, and they have a real understanding of the open source sustainability issues and seem dedicated to helping fix it. >> >>> > >> >> >>> > >>>> - There is no such thing as free money; we have to consider how to account for and actually manage it (perhaps mitigated somewhat by NumFocus) >> >>> > >>> >> >>> > >>> >> >>> > >>> Perhaps Ralph can share how this has gone for NumPy. I imagine it's not too work on their end, thanks to NumFOCUS. >> >>> > >> >> >>> > >> >> >>> > >> NumFOCUS handles receiving the money and associated admin. As the project you'll be responsible for the setup and ongoing tasks. For NumPy and SciPy I have done those tasks. It's a fairly minimal amount of work: https://github.com/numpy/numpy/pulls?q=is%3Apr+tidelift+is%3Aclosed. The main one was dealing with GitHub not recognizing our license, and you don't have that issue for Pandas (it's reported correctly as BSD-3 in the UI at https://github.com/pandas-dev/pandas). >> >>> > >> >> >>> > >> So it's probably a day of work for one person, to get familiar with the interface, check dependencies, release streams, paste in release notes, etc. And then ongoing maybe one or a couple of hours a month. So far it's been a much more effective way of spending time than, for example, grant writing. >> >>> > >> >> >>> > >>> >> >>> > >>>> >> >>> > >>>> - Advertising and ties to a corporate sponsorship may weaken the brand of pandas; at that point we may lose some creditability as open source volunteers >> >>> > >>> >> >>> > >>> >> >>> > >>> Anecdotally, I don't think that's how the community views Tidelift. My perception (from Twitter, blogs / comments) is that it's been well received. >> >>> > >> >> >>> > >> >> >>> > >> Agree, the feedback I've seen is all quite positive. >> >>> > > >> >>> > > >> >>> > > Additionally, I don't think there is any "advertisement" involved, at least not in the classical sense of adding adds for third-party companies in a side bar to our website for which we get money. Of course we will need to mention Tidelift in some way, e.g. in our sponsors / institutional partners section, but we already do that for some other companies as well (that employ core devs). >> >>> > > >> >>> > >> >> >>> > >> >> >>> > >>> >> >>> > >>>> >> >>> > >>>> - We don?t (AFAIK) have a plan on how to spend or allocate it >> >>> > >>>> >> >>> > >>>> Not totally against it but perhaps the last point above is the main sticking one. Do we have any idea how much we?d actually pocket out of the $ 3k they offer us and subsequently what we would do with it? Cover travel expenses? Support PyData conferences? Scholarships? >> >>> > >>> >> >>> > >>> >> >>> > >>> Agreed that we should set a purpose for this money (though, I have no objection to collecting while we set that dedicated purpose). >> >>> > >> >> >>> > >> >> >>> > > Indeed we need to discuss this, but I don't think we already need to know *exactly* what we want to do with it before setting up a contract with Tidelift. It's good for me to alraedy start discussing it now, but maybe in a separate thread? >> >>> > > >> >>> > >> >> >>> > >> For NumPy and SciPy we haven't earmarked the funds yet. It's nice to build up a buffer first. One thing I'm thinking of is that we're participating in Google Season of Docs, and are getting more high quality applicants than Google will accept. So we could pay one or two tech writers from the funds. Our website and high level docs (tutorial, restructuring of all docs to guide users better) sure could use it:) >> >>> > >> >> >>> > >> My abstract advice would be: pay for things that require money (like a dev meeting) or don't get done for free. Don't pay for writing code unless the case is extremely compelling, because that'll be a drop in the bucket. >> >>> > >> >> >>> > >> Cheers, >> >>> > >> Ralf >> >>> > >> >> >>> > >> >> >>> > >>> >> >>> > >>>> >> >>> > >>>> - Will >> >>> > >>>> >> >>> > >>>> On Jun 11, 2019, at 4:44 AM, Ralf Gommers wrote: >> >>> > >>>> >> >>> > >>>> >> >>> > >>>> >> >>> > >>>> On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche wrote: >> >>> > >>>>> >> >>> > >>>>> The current page about pandas (https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 dollar a month (but I am not fully sure this is what is already available from their current subscribers, or if it is a prospect). >> >>> > >>>> >> >>> > >>>> >> >>> > >>>> It's not just a prospect, that's what you should/will get. NumPy and SciPy get the listed amounts too. >> >>> > >>>> >> >>> > >>>> Agreed that the NumPy amount is not that much. The amount gets determined automatically; it's some combination of customer interest, dependency analysis and size of the API surface. >> >>> > >>>> >> >>> > >>>> The current amounts are: >> >>> > >>>> NumPy: $1000 >> >>> > >>>> SciPy: $2500 >> >>> > >>>> Pandas: $3000 >> >>> > >>>> Matplotlib: n.a. >> >>> > >>>> Scikit-learn: $1500 >> >>> > >>>> Scikit-image: $50 >> >>> > >>>> Statsmodels: $50 >> >>> > >>>> >> >>> > >>>> So there's an element of randomness, but the results are not completely surprising I think. The four libraries that get order thousands of dollars are the ones that large corporations are going to have the highest interest in. >> >>> > >>>> >> >>> > >>>> Cheers, >> >>> > >>>> Ralf >> >>> > >>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > >>>>> Op za 8 jun. 2019 om 22:54 schreef William Ayd : >> >>> > >>>>>> >> >>> > >>>>>> What is the minimum amount we are asking for? The $1,000 a month for NumPy seems rather low and I thought previous emails had something in the range of $3k a month. >> >>> > >>>>>> >> >>> > >>>>>> I don?t think we necessarily need or would be that much improved by $12k per year so would rather aim higher if we are going to do this >> >>> > >>>>>> >> >>> > >>>>>> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche wrote: >> >>> > >>>>>> >> >>> > >>>>>> Hi all, >> >>> > >>>>>> >> >>> > >>>>>> We discussed this on the last dev chat, but putting it on the mailing list for those who were not present: we are planning to contact Tidelift to enter into a sponsor agreement for Pandas. >> >>> > >>>>>> >> >>> > >>>>>> The idea is to follow what NumPy (and recently also Scipy) did to have an agreement between Tidelift and NumFOCUS instead of an individual maintainer (see their announcement mail: https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html). >> >>> > >>>>>> Blog with overview about Tidelift: https://blog.tidelift.com/how-to-start-earning-money-for-your-open-source-project-with-tidelift. >> >>> > >>>>>> >> >>> > >>>>>> We didn't discuss yet what to do specifically with those funds, that should still be discussed in the future. >> >>> > >>>>>> >> >>> > >>>>>> Cheers, >> >>> > >>>>>> Joris >> >>> > >>>>>> _______________________________________________ >> >>> > >>>>>> Pandas-dev mailing list >> >>> > >>>>>> Pandas-dev at python.org >> >>> > >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >> >>> > >>>>>> >> >>> > >>>>>> >> >>> > >>>>> _______________________________________________ >> >>> > >>>>> Pandas-dev mailing list >> >>> > >>>>> Pandas-dev at python.org >> >>> > >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >> >>> > >>>> >> >>> > >>>> >> >>> > >>>> _______________________________________________ >> >>> > >>>> Pandas-dev mailing list >> >>> > >>>> Pandas-dev at python.org >> >>> > >>>> https://mail.python.org/mailman/listinfo/pandas-dev >> >>> > >> >> >>> > >> _______________________________________________ >> >>> > >> Pandas-dev mailing list >> >>> > >> Pandas-dev at python.org >> >>> > >> https://mail.python.org/mailman/listinfo/pandas-dev >> >>> > > >> >>> > > _______________________________________________ >> >>> > > Pandas-dev mailing list >> >>> > > Pandas-dev at python.org >> >>> > > https://mail.python.org/mailman/listinfo/pandas-dev >> >>> _______________________________________________ >> >>> Pandas-dev mailing list >> >>> Pandas-dev at python.org >> >>> https://mail.python.org/mailman/listinfo/pandas-dev From jorisvandenbossche at gmail.com Tue Jun 11 18:17:50 2019 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 12 Jun 2019 00:17:50 +0200 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: Hi Brock, Thanks a lot for starting this discussion and the detailed proposal! I will try to look at it in more detail tomorrow, but one general remark: from time to time, we talked about "getting rid of the BlockManager" or "simplifying the BlockManager" (although I am not sure if there is any specific github issue about it, might be from in-person discussions). One of the interpretations of that (or at least how I understood those discussions) was to get away of the 2D block based internals, and go to a simpler "table as collection of 1D arrays" model. This would also enable a simplication of the internals / BlockManager and many of the other items you mention. So I think we should at least compare a more detailed version of what I described above against your proposal. As if we would want to go in that direction long term, I am not sure extensive work on the current 2D blocks-based BlockManager is worth our time. Joris Op di 11 jun. 2019 om 22:38 schreef Brock Mendel : > I've been working on arithmetic/comparison bugs and more recently on > performance problems caused by fixing some of those bugs. After trying > less-invasive approaches, I've concluded a fairly big fix is called for. > This is an RFC for that proposed fix. > > ------ > In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by making > DataFrame arithmetic ops operate column-by-column, dispatching to the > Series implementations. This led to a significant performance hit for > operations on DataFrames with many columns (#24990, #26061). > > To restore the lost performance, we need to have these operations take > place > at the Block level. To prevent DataFrame behavior from diverging from > Series > behavior (again), we need to retain a single shared implementation. > > This is a proposal for how meet these two needs. > > Proposal: > - Allow EA to support 2D arrays > - Use PandasArray to back Block subclasses currently backed by ndarray > - Implement arithmetic and comparison ops directly on PandasArray, then > have Series, DataFrame, and Index ops pass through to the PandasArray > implementations. > > Fixes: > - Performance degradation in DataFrame ops (#24990, #26061) > - The last remaining inconsistencies between Index and Series ops (#19322, > #18824) > - Most of the xfailing arithmetic tests > - #22120: Transposing dataframe loses dtype and ExtensionArray > - #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray has no > reshape > - #23925 DataFrame Quantile Broken with Datetime Data > > Other Upsides: > - Series constructor could dispatch to pd.array, de-duplicating a lot of > code. > - Easier to move to Arrow backend if Blocks are numpy-naive. > - Make EA closer to a drop-in replacement for np.ndarray, necessary if we > want e.g. xarray to find them directly useful (#24716, #24583) > - Block/BlockManager simplifications, see below. > > Downsides: > - Existing constructors assume 1D > - Existing downstream authors assume 1D > - Reduction ops (of which there aren't many) don't have axis kwarg ATM > - But for PandasArray they just pass through to nanops, which already > have+test the axis kwargs > - For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one > implementing the reductions and am OK with this extra complication. > > Block Simplifications: > - Blocks have three main attributes: values, mgr_locs, and ndim > - ndim is _usually_ the same as values.ndim, the exceptions being for > cases where type(values) is restricted to 1D > - Without these restrictions, we can get rid of: > - Block.ndim, associated kludgy ndim-checking code > - numerous can-this-be-reshaped/transposed checks and special cases in > Block and BlockManager code (which are buggy anyway, e.g. #23925) > - With ndim gone, we can then get rid of mgr_locs! > - The blocks themselves never use mgr_locs except when passing to their > own constructors. > - mgr_locs makes _much_ more sense as an attribute of the BlockManager > - With mgr_locs gone, Block becomes just a thin wrapper around an EA > > Implementation Strategy: > - Remove the 1D restriction > - Fairly small tweak, EA subclass must define `shape` instead of > `__len__`; other attrs define in terms of shape. > - Define `transpose`, `T`, `reshape`, and `ravel` > - With this done, several tasks can proceed in parallel: > - simplifications in core.internals, as special-cases for 1D-only can > be removed > - implement and test arithmetic ops on PandasArray > - back Blocks with PandasArray > - back Index (and numeric subclasses) with PandasArray > - Change DataFrame, Series, Index ops to pass through to underlying > Blocks/PandasArrays > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Jun 11 18:25:12 2019 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 11 Jun 2019 18:25:12 -0400 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: Indeed, it's worth considering if perhaps it would be OK to have a performance regression for very wide dataframes instead. With regards to xarray, 2D extension arrays are interesting but still not particularly helpful. We would still need a wrapper to make them fully N-D, which we need for our data model. On Tue, Jun 11, 2019 at 6:18 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi Brock, > > Thanks a lot for starting this discussion and the detailed proposal! > > I will try to look at it in more detail tomorrow, but one general remark: > from time to time, we talked about "getting rid of the BlockManager" or > "simplifying the BlockManager" (although I am not sure if there is any > specific github issue about it, might be from in-person discussions). One > of the interpretations of that (or at least how I understood those > discussions) was to get away of the 2D block based internals, and go to a > simpler "table as collection of 1D arrays" model. This would also enable a > simplication of the internals / BlockManager and many of the other items > you mention. > > So I think we should at least compare a more detailed version of what I > described above against your proposal. As if we would want to go in that > direction long term, I am not sure extensive work on the current 2D > blocks-based BlockManager is worth our time. > > Joris > > Op di 11 jun. 2019 om 22:38 schreef Brock Mendel : > >> I've been working on arithmetic/comparison bugs and more recently on >> performance problems caused by fixing some of those bugs. After trying >> less-invasive approaches, I've concluded a fairly big fix is called for. >> This is an RFC for that proposed fix. >> >> ------ >> In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by making >> DataFrame arithmetic ops operate column-by-column, dispatching to the >> Series implementations. This led to a significant performance hit for >> operations on DataFrames with many columns (#24990, #26061). >> >> To restore the lost performance, we need to have these operations take >> place >> at the Block level. To prevent DataFrame behavior from diverging from >> Series >> behavior (again), we need to retain a single shared implementation. >> >> This is a proposal for how meet these two needs. >> >> Proposal: >> - Allow EA to support 2D arrays >> - Use PandasArray to back Block subclasses currently backed by ndarray >> - Implement arithmetic and comparison ops directly on PandasArray, then >> have Series, DataFrame, and Index ops pass through to the PandasArray >> implementations. >> >> Fixes: >> - Performance degradation in DataFrame ops (#24990, #26061) >> - The last remaining inconsistencies between Index and Series ops >> (#19322, #18824) >> - Most of the xfailing arithmetic tests >> - #22120: Transposing dataframe loses dtype and ExtensionArray >> - #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray has no >> reshape >> - #23925 DataFrame Quantile Broken with Datetime Data >> >> Other Upsides: >> - Series constructor could dispatch to pd.array, de-duplicating a lot of >> code. >> - Easier to move to Arrow backend if Blocks are numpy-naive. >> - Make EA closer to a drop-in replacement for np.ndarray, necessary if we >> want e.g. xarray to find them directly useful (#24716, #24583) >> - Block/BlockManager simplifications, see below. >> >> Downsides: >> - Existing constructors assume 1D >> - Existing downstream authors assume 1D >> - Reduction ops (of which there aren't many) don't have axis kwarg ATM >> - But for PandasArray they just pass through to nanops, which already >> have+test the axis kwargs >> - For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one >> implementing the reductions and am OK with this extra complication. >> >> Block Simplifications: >> - Blocks have three main attributes: values, mgr_locs, and ndim >> - ndim is _usually_ the same as values.ndim, the exceptions being for >> cases where type(values) is restricted to 1D >> - Without these restrictions, we can get rid of: >> - Block.ndim, associated kludgy ndim-checking code >> - numerous can-this-be-reshaped/transposed checks and special cases in >> Block and BlockManager code (which are buggy anyway, e.g. #23925) >> - With ndim gone, we can then get rid of mgr_locs! >> - The blocks themselves never use mgr_locs except when passing to >> their own constructors. >> - mgr_locs makes _much_ more sense as an attribute of the BlockManager >> - With mgr_locs gone, Block becomes just a thin wrapper around an EA >> >> Implementation Strategy: >> - Remove the 1D restriction >> - Fairly small tweak, EA subclass must define `shape` instead of >> `__len__`; other attrs define in terms of shape. >> - Define `transpose`, `T`, `reshape`, and `ravel` >> - With this done, several tasks can proceed in parallel: >> - simplifications in core.internals, as special-cases for 1D-only can >> be removed >> - implement and test arithmetic ops on PandasArray >> - back Blocks with PandasArray >> - back Index (and numeric subclasses) with PandasArray >> - Change DataFrame, Series, Index ops to pass through to underlying >> Blocks/PandasArrays >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Tue Jun 11 21:34:06 2019 From: jeffreback at gmail.com (Jeff Reback) Date: Tue, 11 Jun 2019 21:34:06 -0400 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: I > One risk to be aware of is that if a high profile > project like pandas take's TL's money and none of the maintainers pay > themselves with it, then the monthly number may not have as much of a > chance of increasing (since current or prospective TL customers may > observe that the subscription dollars aren't being used in the way > that is being pitched). I actually see the exact opposite here. A project of pandas stature that decides to better the project is a pretty respectable goal. I believe we would be in the letter and more importantly the spirit of Tidelift for the pandas project itself to take this burden & receive the income. Having the project itself with the combined force of multiple maintainers actually would be much more comforting (from the customer's perspective), than a single maintainer (who may not always be there). Furthermore, we could use these funds for the combined benefit of the project, mainly I think for gatherings like the upcoming sprints. I am not sure many of you know, but pandas has not actively solicited *any* monies, and only received 2 largish contributions over the years, which are the majority of our current funds. The tidelift agreement looks to provide a stream of income which we currently do not have. With an income stream we have options; without we don't. We can always decide to remunerate maintainers who contribute to this effort, though, this should be a separate discussion. Jeff On Tue, Jun 11, 2019 at 5:10 PM Wes McKinney wrote: > On Tue, Jun 11, 2019 at 2:26 PM Andy Ray Terrel > wrote: > > > > > > > > On Tue, Jun 11, 2019 at 12:51 PM Wes McKinney > wrote: > >> > >> hi, > >> > >> On Tue, Jun 11, 2019 at 11:16 AM Ralf Gommers > wrote: > >> > > >> > > >> > > >> > On Tue, Jun 11, 2019 at 4:56 PM Andy Ray Terrel < > andy.terrel at gmail.com> wrote: > >> >> > >> >> While the original lifter agreement was an individual contract, in > our negotiations with Tidelift, NumFOCUS has explicitly sought a model that > allows the project to split the money how they prefer. This was always > Tidelift's intention, it was just faster and easier to scale to focus on > paying individuals. > >> > > >> > > >> > +1 the project decides for themselves is the intent and a good > principle. > >> > > >> >> > >> >> I do like the idea of paying for maintence work, I would recommend > we set up folks as contractors with NumFOCUS rather than just pocketing > money. It will give a lot more legal protection. Then if some folks don't > want to take the cash you they can donate their time and be recognized as > in-kind donations, which might have some tax deductions. > >> > > >> > > >> > Keep in mind that this has a lot of potential issues. Examples: > >> > 1. Who decides who gets paid, and how? The pandas repo has 1500+ > contributors. Lots of potential for friction over small amount of $. > >> > >> More or less the _entire_ point of Tidelift is to incentivize people > >> to do more maintenance work. I think it's worth at least attempting to > >> use this money for its intended economic purpose. > >> > >> The maintainers are, as a first approximation, the ~10-15 active core > >> members listed on > >> > >> https://github.com/pandas-dev/pandas-governance > >> > >> IMHO those are the people that should get paid (going forward) -- if > >> contributors are more motivated to become core team members / > >> maintainers as a result of the Tidelift money, then it has had the > >> desired outcome. > > > > > > I would suggest leaving the decision to the project core team with the > project Numfocus committee to be the overseer of the implementation. > > > > Yes, of course, that's the governance that we have in place. I am just > stressing that we should try to honor the intent of the asset that is > being purchased by Tidelift customers. Tidelift is telling their > customers that the money they are paying is going to end up in the > pockets of the project maintainers > > https://tidelift.com/about/lifter > > If the pandas core team wishes to deny themselves the income (which, > divided up, isn't going to be a life-changing amount of money) that's > their prerogative -- I just wanted to be clear about where I stand on > it, and there's nothing immoral about wanting to be compensated for > one's time (given how much volunteered time has already gone > uncompensated). One risk to be aware of is that if a high profile > project like pandas take's TL's money and none of the maintainers pay > themselves with it, then the monthly number may not have as much of a > chance of increasing (since current or prospective TL customers may > observe that the subscription dollars aren't being used in the way > that is being pitched). > > >> > >> > >> > 2. Many people have employment contracts, those typically forbid > contracting on the side. So inherently unfair to distribute only to those > who are in a position to accept the money. > >> > >> This is true -- at least Jeff and maybe others fall into this > >> category. In such cases their "cut" of the maintenance funds can go > >> into the communal fund to pay for other stuff > >> > > > > Yes such accommodation will need to be worked out. > > > >> > >> > 3. You're now introducing lots of extra paperwork and admin, both > directly and indirectly (who wants to deal with the extra complications > when filing your taxes?). > >> > >> Hopefully we're talking just a 1099 from NumFOCUS with a single number > >> to type in, but I'm the wrong person to judge since my taxes are more > >> complicated than most people's =) > > > > > > Generally it is done that way for US based folks and for folks out of > the US we tend to let them handle their own taxes. We would need to work > that out. > > > > Additionally, as in all dealings with businesses, we do the extra > paperwork for the other benefits such as limiting the liability of a > maintainer. > > > >> > >> > >> > 4. It may create other weird social dynamics. E.g. if money is now > directly coupled to a commit bit, that makes the "who do we give commit > rights and when" a potentially more loaded question. > >> > >> I think this is where the honest self-reporting of time spent comes > >> in. The goal is to increase the average number of maintainer hours per > >> month/year. It's sort of like a crypto-mining pool, but for open > >> source software maintenance =) Obviously maintainers are accountable > >> to the rest of the core team to behave with integrity > >> (professionalism, honesty, etc.) or they can be voted to be removed if > >> they are found to be dishonest. > > > > > >> > >> > > >> > >> > And, dividing it into N chunks, the funding becomes nice beer money > and a thank you for volunteering. Could be exactly what you'd prefer as a > team. But that's imho more in line with the current version of Patreon or > GitHub Sponsors rather then with what Tidelift is aiming for. > >> > > >> > I'd like the idea of "paying for maintenance" if there were enough > money to employ people. But realistically, that will take many years. The > Tidelift slogan on this is unrealistic for a project like Pandas where > maintenance effort is many FTEs; it's perhaps feasible for your typical > Javascript library that's popular but small enough for one person > maintaining it part-time. > >> > > >> >> > >> >> It is something I would volunteer to help manage in order to learn > how other projects might use the same techniques. > >> >> > >> >> -- Andy > >> >> > >> >> On Tue, Jun 11, 2019 at 9:13 AM Wes McKinney > wrote: > >> >>> > >> >>> > How you allocate the money to each other is something you can > debate privately > >> >>> > >> >>> On this, I'm sure that you could set up a lightweight virtual > >> >>> "timesheet" so you can put yourselves "on the clock" when you're > doing > >> >>> project maintenance work (there are many of these online, I just > read > >> >>> about https://www.clockspot.com/ recently) to make time reporting a > >> >>> bit more accurate > >> >>> > >> >>> On Tue, Jun 11, 2019 at 9:09 AM Wes McKinney > wrote: > >> >>> > > >> >>> > Personally, I would recommend putting most of the money in your > own > >> >>> > pockets. The whole idea of Tidelift (as I understand it) is for > the > >> >>> > individuals doing work that is of importance to project users (to > whom > >> >>> > Tidelift is providing indemnification and "insurance" against > defects) > >> > > >> > > >> > Actually that's only partially true. Tidelift is paying for very > specific things, that allow them to do aggregated reporting on licensing, > dependencies, security vulnerabilities, release streams & release docs, > etc. - basically the stuff that helps large corporations do due diligence > and management of a large software stack. > >> > > >> > It is explicitly out of scope to work on bugs or enhancements in the > NumFOCUS-Tidelift agreement (and working on particular technical items was > never their intention). So "insurance against defects" isn't part of this, > except in a very abstract sense of making the project healthier and > therefore reducing the risk of it being abandoned or a lot more buggy on > the many-year time scale. > >> > > >> > Cheers, > >> > Ralf > >> > > >> > > >> >>> > to get paid for their labor. So I think the most honest way to > use the > >> >>> > money is to put it in your respective bank accounts. If you've > getting > >> >>> > a little bit of money to spend on yourself, doesn't that make > doing > >> >>> > the maintenance work a bit less thankless? If you don't pay > >> >>> > yourselves, I think it actually "breaks" Tidelift's pitch to > customers > >> >>> > which is that open source projects need to have a higher fraction > of > >> >>> > compensated maintenance and support work than they do now. > >> >>> > > >> >>> > How you allocate the money to each other is something you can > debate privately > >> >>> > > >> >>> > On Tue, Jun 11, 2019 at 8:42 AM Joris Van den Bossche > >> >>> > wrote: > >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > Op di 11 jun. 2019 om 15:31 schreef Ralf Gommers < > ralf.gommers at gmail.com>: > >> >>> > >> > >> >>> > >> > >> >>> > >> > >> >>> > >> On Tue, Jun 11, 2019 at 3:03 PM Tom Augspurger < > tom.augspurger88 at gmail.com> wrote: > >> >>> > >>> > >> >>> > >>> > >> >>> > >>> > >> >>> > >>> On Tue, Jun 11, 2019 at 7:58 AM William Ayd via Pandas-dev < > pandas-dev at python.org> wrote: > >> >>> > >>>> > >> >>> > >>>> Just some counterpoints to consider: > >> >>> > >>>> > >> >>> > >>>> - $ 3,000 a month isn?t really that much, and if it?s just a > number that a well-funded company chose for us chances are they are > benefiting from it way more than we are > >> >>> > >> > >> >>> > >> > >> >>> > >> "it's not really that much" is something I don't agree with. > It doesn't employ someone, but it's enough to pay for things like developer > meetups, hiring an extra GSoC student if a good one happens to come along, > paying a web dev for a full redesign of the project website, etc. Each of > those things is in the $5,000 - %15,000 range, and it's _very_ nice to be > able to do them without having to look for funding first. > >> >>> > >> > >> >>> > >> Tidelift is a small (now ~25 employees) company by the way, > and they have a real understanding of the open source sustainability issues > and seem dedicated to helping fix it. > >> >>> > >> > >> >>> > >>>> - There is no such thing as free money; we have to consider > how to account for and actually manage it (perhaps mitigated somewhat by > NumFocus) > >> >>> > >>> > >> >>> > >>> > >> >>> > >>> Perhaps Ralph can share how this has gone for NumPy. I > imagine it's not too work on their end, thanks to NumFOCUS. > >> >>> > >> > >> >>> > >> > >> >>> > >> NumFOCUS handles receiving the money and associated admin. As > the project you'll be responsible for the setup and ongoing tasks. For > NumPy and SciPy I have done those tasks. It's a fairly minimal amount of > work: https://github.com/numpy/numpy/pulls?q=is%3Apr+tidelift+is%3Aclosed. > The main one was dealing with GitHub not recognizing our license, and you > don't have that issue for Pandas (it's reported correctly as BSD-3 in the > UI at https://github.com/pandas-dev/pandas). > >> >>> > >> > >> >>> > >> So it's probably a day of work for one person, to get familiar > with the interface, check dependencies, release streams, paste in release > notes, etc. And then ongoing maybe one or a couple of hours a month. So far > it's been a much more effective way of spending time than, for example, > grant writing. > >> >>> > >> > >> >>> > >>> > >> >>> > >>>> > >> >>> > >>>> - Advertising and ties to a corporate sponsorship may weaken > the brand of pandas; at that point we may lose some creditability as open > source volunteers > >> >>> > >>> > >> >>> > >>> > >> >>> > >>> Anecdotally, I don't think that's how the community views > Tidelift. My perception (from Twitter, blogs / comments) is that it's been > well received. > >> >>> > >> > >> >>> > >> > >> >>> > >> Agree, the feedback I've seen is all quite positive. > >> >>> > > > >> >>> > > > >> >>> > > Additionally, I don't think there is any "advertisement" > involved, at least not in the classical sense of adding adds for > third-party companies in a side bar to our website for which we get money. > Of course we will need to mention Tidelift in some way, e.g. in our > sponsors / institutional partners section, but we already do that for some > other companies as well (that employ core devs). > >> >>> > > > >> >>> > >> > >> >>> > >> > >> >>> > >>> > >> >>> > >>>> > >> >>> > >>>> - We don?t (AFAIK) have a plan on how to spend or allocate it > >> >>> > >>>> > >> >>> > >>>> Not totally against it but perhaps the last point above is > the main sticking one. Do we have any idea how much we?d actually pocket > out of the $ 3k they offer us and subsequently what we would do with it? > Cover travel expenses? Support PyData conferences? Scholarships? > >> >>> > >>> > >> >>> > >>> > >> >>> > >>> Agreed that we should set a purpose for this money (though, I > have no objection to collecting while we set that dedicated purpose). > >> >>> > >> > >> >>> > >> > >> >>> > > Indeed we need to discuss this, but I don't think we already > need to know *exactly* what we want to do with it before setting up a > contract with Tidelift. It's good for me to alraedy start discussing it > now, but maybe in a separate thread? > >> >>> > > > >> >>> > >> > >> >>> > >> For NumPy and SciPy we haven't earmarked the funds yet. It's > nice to build up a buffer first. One thing I'm thinking of is that we're > participating in Google Season of Docs, and are getting more high quality > applicants than Google will accept. So we could pay one or two tech writers > from the funds. Our website and high level docs (tutorial, restructuring of > all docs to guide users better) sure could use it:) > >> >>> > >> > >> >>> > >> My abstract advice would be: pay for things that require money > (like a dev meeting) or don't get done for free. Don't pay for writing code > unless the case is extremely compelling, because that'll be a drop in the > bucket. > >> >>> > >> > >> >>> > >> Cheers, > >> >>> > >> Ralf > >> >>> > >> > >> >>> > >> > >> >>> > >>> > >> >>> > >>>> > >> >>> > >>>> - Will > >> >>> > >>>> > >> >>> > >>>> On Jun 11, 2019, at 4:44 AM, Ralf Gommers < > ralf.gommers at gmail.com> wrote: > >> >>> > >>>> > >> >>> > >>>> > >> >>> > >>>> > >> >>> > >>>> On Tue, Jun 11, 2019 at 10:15 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> >>> > >>>>> > >> >>> > >>>>> The current page about pandas ( > https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 dollar a > month (but I am not fully sure this is what is already available from their > current subscribers, or if it is a prospect). > >> >>> > >>>> > >> >>> > >>>> > >> >>> > >>>> It's not just a prospect, that's what you should/will get. > NumPy and SciPy get the listed amounts too. > >> >>> > >>>> > >> >>> > >>>> Agreed that the NumPy amount is not that much. The amount > gets determined automatically; it's some combination of customer interest, > dependency analysis and size of the API surface. > >> >>> > >>>> > >> >>> > >>>> The current amounts are: > >> >>> > >>>> NumPy: $1000 > >> >>> > >>>> SciPy: $2500 > >> >>> > >>>> Pandas: $3000 > >> >>> > >>>> Matplotlib: n.a. > >> >>> > >>>> Scikit-learn: $1500 > >> >>> > >>>> Scikit-image: $50 > >> >>> > >>>> Statsmodels: $50 > >> >>> > >>>> > >> >>> > >>>> So there's an element of randomness, but the results are not > completely surprising I think. The four libraries that get order thousands > of dollars are the ones that large corporations are going to have the > highest interest in. > >> >>> > >>>> > >> >>> > >>>> Cheers, > >> >>> > >>>> Ralf > >> >>> > >>>> > >> >>> > >>>>> > >> >>> > >>>>> > >> >>> > >>>>> Op za 8 jun. 2019 om 22:54 schreef William Ayd < > william.ayd at icloud.com>: > >> >>> > >>>>>> > >> >>> > >>>>>> What is the minimum amount we are asking for? The $1,000 a > month for NumPy seems rather low and I thought previous emails had > something in the range of $3k a month. > >> >>> > >>>>>> > >> >>> > >>>>>> I don?t think we necessarily need or would be that much > improved by $12k per year so would rather aim higher if we are going to do > this > >> >>> > >>>>>> > >> >>> > >>>>>> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> >>> > >>>>>> > >> >>> > >>>>>> Hi all, > >> >>> > >>>>>> > >> >>> > >>>>>> We discussed this on the last dev chat, but putting it on > the mailing list for those who were not present: we are planning to contact > Tidelift to enter into a sponsor agreement for Pandas. > >> >>> > >>>>>> > >> >>> > >>>>>> The idea is to follow what NumPy (and recently also Scipy) > did to have an agreement between Tidelift and NumFOCUS instead of an > individual maintainer (see their announcement mail: > https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html > ). > >> >>> > >>>>>> Blog with overview about Tidelift: > https://blog.tidelift.com/how-to-start-earning-money-for-your-open-source-project-with-tidelift > . > >> >>> > >>>>>> > >> >>> > >>>>>> We didn't discuss yet what to do specifically with those > funds, that should still be discussed in the future. > >> >>> > >>>>>> > >> >>> > >>>>>> Cheers, > >> >>> > >>>>>> Joris > >> >>> > >>>>>> _______________________________________________ > >> >>> > >>>>>> Pandas-dev mailing list > >> >>> > >>>>>> Pandas-dev at python.org > >> >>> > >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>> > >>>>>> > >> >>> > >>>>>> > >> >>> > >>>>> _______________________________________________ > >> >>> > >>>>> Pandas-dev mailing list > >> >>> > >>>>> Pandas-dev at python.org > >> >>> > >>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>> > >>>> > >> >>> > >>>> > >> >>> > >>>> _______________________________________________ > >> >>> > >>>> Pandas-dev mailing list > >> >>> > >>>> Pandas-dev at python.org > >> >>> > >>>> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>> > >> > >> >>> > >> _______________________________________________ > >> >>> > >> Pandas-dev mailing list > >> >>> > >> Pandas-dev at python.org > >> >>> > >> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>> > > > >> >>> > > _______________________________________________ > >> >>> > > Pandas-dev mailing list > >> >>> > > Pandas-dev at python.org > >> >>> > > https://mail.python.org/mailman/listinfo/pandas-dev > >> >>> _______________________________________________ > >> >>> Pandas-dev mailing list > >> >>> Pandas-dev at python.org > >> >>> https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Tue Jun 11 22:08:00 2019 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Tue, 11 Jun 2019 21:08:00 -0500 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: One general question, motivated by Joris' same concern about the future simplified BlockManager: why does block-based, rather than column-based, ops require 2D Extension Arrays? You say > by making DataFrame arithmetic ops operate column-by-column, dispatching to > the Series implementations. Could we instead dispatch both Series and DataFrame ops to Block ops (which then do the op on the ndarray or dispatch to the EA)? If I understand your proposal correctly, then you still have the general DataFrame -> Block -> Array nesting doll. It seems like that should work equally well with our current mix of 2-D and 1-D blocks. So while I agree that Blocks being backed by a maybe 1D / maybe 2D array causes no end of headaches, I don't see why block-based ops need 2D EAs (though I'm not especially familiar with this area; I could easily be missing something basic). - Tom On Tue, Jun 11, 2019 at 5:25 PM Stephan Hoyer wrote: > Indeed, it's worth considering if perhaps it would be OK to have a > performance regression for very wide dataframes instead. > > With regards to xarray, 2D extension arrays are interesting but still not > particularly helpful. We would still need a wrapper to make them fully N-D, > which we need for our data model. > > On Tue, Jun 11, 2019 at 6:18 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Hi Brock, >> >> Thanks a lot for starting this discussion and the detailed proposal! >> >> I will try to look at it in more detail tomorrow, but one general remark: >> from time to time, we talked about "getting rid of the BlockManager" or >> "simplifying the BlockManager" (although I am not sure if there is any >> specific github issue about it, might be from in-person discussions). One >> of the interpretations of that (or at least how I understood those >> discussions) was to get away of the 2D block based internals, and go to a >> simpler "table as collection of 1D arrays" model. This would also enable a >> simplication of the internals / BlockManager and many of the other items >> you mention. >> >> So I think we should at least compare a more detailed version of what I >> described above against your proposal. As if we would want to go in that >> direction long term, I am not sure extensive work on the current 2D >> blocks-based BlockManager is worth our time. >> >> Joris >> >> Op di 11 jun. 2019 om 22:38 schreef Brock Mendel > >: >> >>> I've been working on arithmetic/comparison bugs and more recently on >>> performance problems caused by fixing some of those bugs. After trying >>> less-invasive approaches, I've concluded a fairly big fix is called for. >>> This is an RFC for that proposed fix. >>> >>> ------ >>> In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by >>> making DataFrame arithmetic ops operate column-by-column, dispatching to >>> the Series implementations. This led to a significant performance hit for >>> operations on DataFrames with many columns (#24990, #26061). >>> >>> To restore the lost performance, we need to have these operations take >>> place >>> at the Block level. To prevent DataFrame behavior from diverging from >>> Series >>> behavior (again), we need to retain a single shared implementation. >>> >>> This is a proposal for how meet these two needs. >>> >>> Proposal: >>> - Allow EA to support 2D arrays >>> - Use PandasArray to back Block subclasses currently backed by ndarray >>> - Implement arithmetic and comparison ops directly on PandasArray, then >>> have Series, DataFrame, and Index ops pass through to the PandasArray >>> implementations. >>> >>> Fixes: >>> - Performance degradation in DataFrame ops (#24990, #26061) >>> - The last remaining inconsistencies between Index and Series ops >>> (#19322, #18824) >>> - Most of the xfailing arithmetic tests >>> - #22120: Transposing dataframe loses dtype and ExtensionArray >>> - #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray has >>> no reshape >>> - #23925 DataFrame Quantile Broken with Datetime Data >>> >>> Other Upsides: >>> - Series constructor could dispatch to pd.array, de-duplicating a lot of >>> code. >>> - Easier to move to Arrow backend if Blocks are numpy-naive. >>> - Make EA closer to a drop-in replacement for np.ndarray, necessary if >>> we want e.g. xarray to find them directly useful (#24716, #24583) >>> - Block/BlockManager simplifications, see below. >>> >>> Downsides: >>> - Existing constructors assume 1D >>> - Existing downstream authors assume 1D >>> - Reduction ops (of which there aren't many) don't have axis kwarg ATM >>> - But for PandasArray they just pass through to nanops, which already >>> have+test the axis kwargs >>> - For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one >>> implementing the reductions and am OK with this extra complication. >>> >>> Block Simplifications: >>> - Blocks have three main attributes: values, mgr_locs, and ndim >>> - ndim is _usually_ the same as values.ndim, the exceptions being for >>> cases where type(values) is restricted to 1D >>> - Without these restrictions, we can get rid of: >>> - Block.ndim, associated kludgy ndim-checking code >>> - numerous can-this-be-reshaped/transposed checks and special cases >>> in Block and BlockManager code (which are buggy anyway, e.g. #23925) >>> - With ndim gone, we can then get rid of mgr_locs! >>> - The blocks themselves never use mgr_locs except when passing to >>> their own constructors. >>> - mgr_locs makes _much_ more sense as an attribute of the BlockManager >>> - With mgr_locs gone, Block becomes just a thin wrapper around an EA >>> >>> Implementation Strategy: >>> - Remove the 1D restriction >>> - Fairly small tweak, EA subclass must define `shape` instead of >>> `__len__`; other attrs define in terms of shape. >>> - Define `transpose`, `T`, `reshape`, and `ravel` >>> - With this done, several tasks can proceed in parallel: >>> - simplifications in core.internals, as special-cases for 1D-only can >>> be removed >>> - implement and test arithmetic ops on PandasArray >>> - back Blocks with PandasArray >>> - back Index (and numeric subclasses) with PandasArray >>> - Change DataFrame, Series, Index ops to pass through to underlying >>> Blocks/PandasArrays >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From william.ayd at icloud.com Wed Jun 12 08:53:22 2019 From: william.ayd at icloud.com (William Ayd) Date: Wed, 12 Jun 2019 08:53:22 -0400 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: I?m wary to expand operations done at the Block level. As a core developer for over a year now, I?ve done zero work with blocks and I think they definitely come at an extra development / maintenance cost. I think wide DataFrames are the exception rather than the norm so it?s probably not worth code to eek out a 10% performance boost for those (I?m taking that figure from one of your comments in #24990). - Will > On Jun 11, 2019, at 10:08 PM, Tom Augspurger wrote: > > One general question, motivated by Joris' same concern about the future > simplified BlockManager: why does block-based, rather than column-based, ops > require 2D Extension Arrays? You say > > > by making DataFrame arithmetic ops operate column-by-column, dispatching to > > the Series implementations. > > Could we instead dispatch both Series and DataFrame ops to Block ops (which then > do the op on the ndarray or dispatch to the EA)? If I understand your proposal > correctly, then you still have the general DataFrame -> Block -> Array nesting > doll. It seems like that should work equally well with our current mix of 2-D > and 1-D blocks. > > So while I agree that Blocks being backed by a maybe 1D / maybe 2D array causes > no end of headaches, I don't see why block-based ops need 2D EAs (though I'm not > especially familiar with this area; I could easily be missing something basic). > > - Tom > > On Tue, Jun 11, 2019 at 5:25 PM Stephan Hoyer > wrote: > Indeed, it's worth considering if perhaps it would be OK to have a performance regression for very wide dataframes instead. > > With regards to xarray, 2D extension arrays are interesting but still not particularly helpful. We would still need a wrapper to make them fully N-D, which we need for our data model. > > On Tue, Jun 11, 2019 at 6:18 PM Joris Van den Bossche > wrote: > Hi Brock, > > Thanks a lot for starting this discussion and the detailed proposal! > > I will try to look at it in more detail tomorrow, but one general remark: from time to time, we talked about "getting rid of the BlockManager" or "simplifying the BlockManager" (although I am not sure if there is any specific github issue about it, might be from in-person discussions). One of the interpretations of that (or at least how I understood those discussions) was to get away of the 2D block based internals, and go to a simpler "table as collection of 1D arrays" model. This would also enable a simplication of the internals / BlockManager and many of the other items you mention. > > So I think we should at least compare a more detailed version of what I described above against your proposal. As if we would want to go in that direction long term, I am not sure extensive work on the current 2D blocks-based BlockManager is worth our time. > > Joris > > Op di 11 jun. 2019 om 22:38 schreef Brock Mendel >: > I've been working on arithmetic/comparison bugs and more recently on performance problems caused by fixing some of those bugs. After trying less-invasive approaches, I've concluded a fairly big fix is called for. This is an RFC for that proposed fix. > > ------ > In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by making DataFrame arithmetic ops operate column-by-column, dispatching to the Series implementations. This led to a significant performance hit for operations on DataFrames with many columns (#24990, #26061). > > To restore the lost performance, we need to have these operations take place > at the Block level. To prevent DataFrame behavior from diverging from Series > behavior (again), we need to retain a single shared implementation. > > This is a proposal for how meet these two needs. > > Proposal: > - Allow EA to support 2D arrays > - Use PandasArray to back Block subclasses currently backed by ndarray > - Implement arithmetic and comparison ops directly on PandasArray, then have Series, DataFrame, and Index ops pass through to the PandasArray implementations. > > Fixes: > - Performance degradation in DataFrame ops (#24990, #26061) > - The last remaining inconsistencies between Index and Series ops (#19322, #18824) > - Most of the xfailing arithmetic tests > - #22120: Transposing dataframe loses dtype and ExtensionArray > - #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray has no reshape > - #23925 DataFrame Quantile Broken with Datetime Data > > Other Upsides: > - Series constructor could dispatch to pd.array, de-duplicating a lot of code. > - Easier to move to Arrow backend if Blocks are numpy-naive. > - Make EA closer to a drop-in replacement for np.ndarray, necessary if we want e.g. xarray to find them directly useful (#24716, #24583) > - Block/BlockManager simplifications, see below. > > Downsides: > - Existing constructors assume 1D > - Existing downstream authors assume 1D > - Reduction ops (of which there aren't many) don't have axis kwarg ATM > - But for PandasArray they just pass through to nanops, which already have+test the axis kwargs > - For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one implementing the reductions and am OK with this extra complication. > > Block Simplifications: > - Blocks have three main attributes: values, mgr_locs, and ndim > - ndim is _usually_ the same as values.ndim, the exceptions being for cases where type(values) is restricted to 1D > - Without these restrictions, we can get rid of: > - Block.ndim, associated kludgy ndim-checking code > - numerous can-this-be-reshaped/transposed checks and special cases in Block and BlockManager code (which are buggy anyway, e.g. #23925) > - With ndim gone, we can then get rid of mgr_locs! > - The blocks themselves never use mgr_locs except when passing to their own constructors. > - mgr_locs makes _much_ more sense as an attribute of the BlockManager > - With mgr_locs gone, Block becomes just a thin wrapper around an EA > > Implementation Strategy: > - Remove the 1D restriction > - Fairly small tweak, EA subclass must define `shape` instead of `__len__`; other attrs define in terms of shape. > - Define `transpose`, `T`, `reshape`, and `ravel` > - With this done, several tasks can proceed in parallel: > - simplifications in core.internals, as special-cases for 1D-only can be removed > - implement and test arithmetic ops on PandasArray > - back Blocks with PandasArray > - back Index (and numeric subclasses) with PandasArray > - Change DataFrame, Series, Index ops to pass through to underlying Blocks/PandasArrays > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Wed Jun 12 10:46:37 2019 From: jbrockmendel at gmail.com (Brock Mendel) Date: Wed, 12 Jun 2019 09:46:37 -0500 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: TL;DR: > So while I agree that Blocks being backed by a maybe 1D / maybe 2D array causes no end of headaches For readers who don't find the performance issue compelling, the bugs and complexity this addresses should be compelling. -------- > Could we instead dispatch both Series and DataFrame ops to Block ops (which then do the op on the ndarray or dispatch to the EA)? @TomAugspurger Yes, though as mentioned in the OP, my attempts so far to make this work have failed. This suggestion boils down to effectively implementing these ops on Block, which is the opposite of the direction we want to be taking the Block classes. In terms of Separation of Concerns it makes much more sense for the array-like operations to be defined on a dedicated array class, in this case PandasArray. Moreover, implementing them on PandasArray gives us "for free" consistency between Series/DataFrame, Index, and PandasArray ops, whereas implementing them on Block gives only Series/DataFrame consistency. > 10% performance boost for those (I?m taking that figure from one of your comments in #24990). @WillAyd that comment referred to the cost of instantiating the DataFrame, not the arithmetic op. Earlier in that same comment I refer to the arithmetic op as being 10x slower, not 10% slower. > I?ve done zero work with blocks and I think they definitely come at an extra development / maintenance cost. I've done a bunch of work with blocks, mostly trying to get code _out_ of them. Ignore the entire performance issue: allowing EA to be 2D (heck, even restricted to (1, N) and (N, 1) would be enough!) would let us rip out so much (buggy) code I'll shed tears of joy. On Wed, Jun 12, 2019 at 7:53 AM William Ayd via Pandas-dev < pandas-dev at python.org> wrote: > I?m wary to expand operations done at the Block level. As a core developer > for over a year now, I?ve done zero work with blocks and I think they > definitely come at an extra development / maintenance cost. > > I think wide DataFrames are the exception rather than the norm so it?s > probably not worth code to eek out a 10% performance boost for those (I?m > taking that figure from one of your comments in #24990). > > - Will > > On Jun 11, 2019, at 10:08 PM, Tom Augspurger > wrote: > > One general question, motivated by Joris' same concern about the future > simplified BlockManager: why does block-based, rather than column-based, > ops > require 2D Extension Arrays? You say > > > by making DataFrame arithmetic ops operate column-by-column, dispatching > to > > the Series implementations. > > Could we instead dispatch both Series and DataFrame ops to Block ops > (which then > do the op on the ndarray or dispatch to the EA)? If I understand your > proposal > correctly, then you still have the general DataFrame -> Block -> Array > nesting > doll. It seems like that should work equally well with our current mix of > 2-D > and 1-D blocks. > > So while I agree that Blocks being backed by a maybe 1D / maybe 2D array > causes > no end of headaches, I don't see why block-based ops need 2D EAs (though > I'm not > especially familiar with this area; I could easily be missing something > basic). > > - Tom > > On Tue, Jun 11, 2019 at 5:25 PM Stephan Hoyer wrote: > >> Indeed, it's worth considering if perhaps it would be OK to have a >> performance regression for very wide dataframes instead. >> >> With regards to xarray, 2D extension arrays are interesting but still not >> particularly helpful. We would still need a wrapper to make them fully N-D, >> which we need for our data model. >> >> On Tue, Jun 11, 2019 at 6:18 PM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> Hi Brock, >>> >>> Thanks a lot for starting this discussion and the detailed proposal! >>> >>> I will try to look at it in more detail tomorrow, but one general >>> remark: from time to time, we talked about "getting rid of the >>> BlockManager" or "simplifying the BlockManager" (although I am not sure if >>> there is any specific github issue about it, might be from in-person >>> discussions). One of the interpretations of that (or at least how I >>> understood those discussions) was to get away of the 2D block based >>> internals, and go to a simpler "table as collection of 1D arrays" model. >>> This would also enable a simplication of the internals / BlockManager and >>> many of the other items you mention. >>> >>> So I think we should at least compare a more detailed version of what I >>> described above against your proposal. As if we would want to go in that >>> direction long term, I am not sure extensive work on the current 2D >>> blocks-based BlockManager is worth our time. >>> >>> Joris >>> >>> Op di 11 jun. 2019 om 22:38 schreef Brock Mendel >> >: >>> >>>> I've been working on arithmetic/comparison bugs and more recently on >>>> performance problems caused by fixing some of those bugs. After trying >>>> less-invasive approaches, I've concluded a fairly big fix is called for. >>>> This is an RFC for that proposed fix. >>>> >>>> ------ >>>> In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by >>>> making DataFrame arithmetic ops operate column-by-column, dispatching to >>>> the Series implementations. This led to a significant performance hit for >>>> operations on DataFrames with many columns (#24990, #26061). >>>> >>>> To restore the lost performance, we need to have these operations take >>>> place >>>> at the Block level. To prevent DataFrame behavior from diverging from >>>> Series >>>> behavior (again), we need to retain a single shared implementation. >>>> >>>> This is a proposal for how meet these two needs. >>>> >>>> Proposal: >>>> - Allow EA to support 2D arrays >>>> - Use PandasArray to back Block subclasses currently backed by ndarray >>>> - Implement arithmetic and comparison ops directly on PandasArray, then >>>> have Series, DataFrame, and Index ops pass through to the PandasArray >>>> implementations. >>>> >>>> Fixes: >>>> - Performance degradation in DataFrame ops (#24990, #26061) >>>> - The last remaining inconsistencies between Index and Series ops >>>> (#19322, #18824) >>>> - Most of the xfailing arithmetic tests >>>> - #22120: Transposing dataframe loses dtype and ExtensionArray >>>> - #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray has >>>> no reshape >>>> - #23925 DataFrame Quantile Broken with Datetime Data >>>> >>>> Other Upsides: >>>> - Series constructor could dispatch to pd.array, de-duplicating a lot >>>> of code. >>>> - Easier to move to Arrow backend if Blocks are numpy-naive. >>>> - Make EA closer to a drop-in replacement for np.ndarray, necessary if >>>> we want e.g. xarray to find them directly useful (#24716, #24583) >>>> - Block/BlockManager simplifications, see below. >>>> >>>> Downsides: >>>> - Existing constructors assume 1D >>>> - Existing downstream authors assume 1D >>>> - Reduction ops (of which there aren't many) don't have axis kwarg ATM >>>> - But for PandasArray they just pass through to nanops, which >>>> already have+test the axis kwargs >>>> - For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one >>>> implementing the reductions and am OK with this extra complication. >>>> >>>> Block Simplifications: >>>> - Blocks have three main attributes: values, mgr_locs, and ndim >>>> - ndim is _usually_ the same as values.ndim, the exceptions being for >>>> cases where type(values) is restricted to 1D >>>> - Without these restrictions, we can get rid of: >>>> - Block.ndim, associated kludgy ndim-checking code >>>> - numerous can-this-be-reshaped/transposed checks and special cases >>>> in Block and BlockManager code (which are buggy anyway, e.g. #23925) >>>> - With ndim gone, we can then get rid of mgr_locs! >>>> - The blocks themselves never use mgr_locs except when passing to >>>> their own constructors. >>>> - mgr_locs makes _much_ more sense as an attribute of the >>>> BlockManager >>>> - With mgr_locs gone, Block becomes just a thin wrapper around an EA >>>> >>>> Implementation Strategy: >>>> - Remove the 1D restriction >>>> - Fairly small tweak, EA subclass must define `shape` instead of >>>> `__len__`; other attrs define in terms of shape. >>>> - Define `transpose`, `T`, `reshape`, and `ravel` >>>> - With this done, several tasks can proceed in parallel: >>>> - simplifications in core.internals, as special-cases for 1D-only >>>> can be removed >>>> - implement and test arithmetic ops on PandasArray >>>> - back Blocks with PandasArray >>>> - back Index (and numeric subclasses) with PandasArray >>>> - Change DataFrame, Series, Index ops to pass through to underlying >>>> Blocks/PandasArrays >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Wed Jun 12 11:56:02 2019 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Wed, 12 Jun 2019 10:56:02 -0500 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: On Wed, Jun 12, 2019 at 9:46 AM Brock Mendel wrote: > TL;DR: > > > So while I agree that Blocks being backed by a maybe 1D / maybe 2D array > causes no end of headaches > > For readers who don't find the performance issue compelling, the bugs and > complexity this addresses should be compelling. > > -------- > > > Could we instead dispatch both Series and DataFrame ops to Block ops > (which then > do the op on the ndarray or dispatch to the EA)? > > @TomAugspurger Yes, though as mentioned in the OP, my attempts so far to > make this work have failed. > > This suggestion boils down to effectively implementing these ops on Block, > which is the opposite of the direction we want to be taking the Block > classes. In terms of Separation of Concerns it makes much more sense for > the array-like operations to be defined on a dedicated array class, in this > case PandasArray. > I think we're in agreement here. Moreover, implementing them on PandasArray gives us "for free" consistency > between Series/DataFrame, Index, and PandasArray ops, whereas implementing > them on Block gives only Series/DataFrame consistency. > > > 10% performance boost for those (I?m taking that figure from one of your > comments in #24990). > > @WillAyd that comment referred to the cost of instantiating the DataFrame, > not the arithmetic op. Earlier in that same comment I refer to the > arithmetic op as being 10x slower, not 10% slower. > > > I?ve done zero work with blocks and I think they definitely come at an > extra development / maintenance cost. > > I've done a bunch of work with blocks, mostly trying to get code _out_ of > them. Ignore the entire performance issue: allowing EA to be 2D (heck, > even restricted to (1, N) and (N, 1) would be enough!) would let us rip out > so much (buggy) code I'll shed tears of joy. > > Stepping back a bit, I see two potential issues we'd like to solve 1. The current structure of - Container (dataframe, series, index) -> - Block (DataFrame / Series only) -> - Array (ndarray or EA) is bad for two reasons: first, Indexes don't have Blocks; this argues for putting more functionality on the Array, to share code between all the containers; second, Array can be an ndarray or an EA. They're different enough that EA isn't a drop-in replacement for ndarray. 2. Arrays being either 1D or 2D causes many issues. A few questions Q1: Do those two issues accurately capture your concerns as well? Q2: Can you clarify: with 2D EAs would *all* EAs stored within pandas be 2D internally (and Series / Index would squeeze before data gets back to the user)? Otherwise, I don't see how we get the internal simplification. Q3: What do you think about a simple, private PandasArray-like thing that *is* allowed to be 2D, and itself wraps a 2D ndarray? That solves my problem 1, but doesn't address problem 2. Tom > On Wed, Jun 12, 2019 at 7:53 AM William Ayd via Pandas-dev < > pandas-dev at python.org> wrote: > >> I?m wary to expand operations done at the Block level. As a core >> developer for over a year now, I?ve done zero work with blocks and I think >> they definitely come at an extra development / maintenance cost. >> >> I think wide DataFrames are the exception rather than the norm so it?s >> probably not worth code to eek out a 10% performance boost for those (I?m >> taking that figure from one of your comments in #24990). >> >> - Will >> >> On Jun 11, 2019, at 10:08 PM, Tom Augspurger >> wrote: >> >> One general question, motivated by Joris' same concern about the future >> simplified BlockManager: why does block-based, rather than column-based, >> ops >> require 2D Extension Arrays? You say >> >> > by making DataFrame arithmetic ops operate column-by-column, >> dispatching to >> > the Series implementations. >> >> Could we instead dispatch both Series and DataFrame ops to Block ops >> (which then >> do the op on the ndarray or dispatch to the EA)? If I understand your >> proposal >> correctly, then you still have the general DataFrame -> Block -> Array >> nesting >> doll. It seems like that should work equally well with our current mix of >> 2-D >> and 1-D blocks. >> >> So while I agree that Blocks being backed by a maybe 1D / maybe 2D array >> causes >> no end of headaches, I don't see why block-based ops need 2D EAs (though >> I'm not >> especially familiar with this area; I could easily be missing something >> basic). >> >> - Tom >> >> On Tue, Jun 11, 2019 at 5:25 PM Stephan Hoyer wrote: >> >>> Indeed, it's worth considering if perhaps it would be OK to have a >>> performance regression for very wide dataframes instead. >>> >>> With regards to xarray, 2D extension arrays are interesting but still >>> not particularly helpful. We would still need a wrapper to make them fully >>> N-D, which we need for our data model. >>> >>> On Tue, Jun 11, 2019 at 6:18 PM Joris Van den Bossche < >>> jorisvandenbossche at gmail.com> wrote: >>> >>>> Hi Brock, >>>> >>>> Thanks a lot for starting this discussion and the detailed proposal! >>>> >>>> I will try to look at it in more detail tomorrow, but one general >>>> remark: from time to time, we talked about "getting rid of the >>>> BlockManager" or "simplifying the BlockManager" (although I am not sure if >>>> there is any specific github issue about it, might be from in-person >>>> discussions). One of the interpretations of that (or at least how I >>>> understood those discussions) was to get away of the 2D block based >>>> internals, and go to a simpler "table as collection of 1D arrays" model. >>>> This would also enable a simplication of the internals / BlockManager and >>>> many of the other items you mention. >>>> >>>> So I think we should at least compare a more detailed version of what I >>>> described above against your proposal. As if we would want to go in that >>>> direction long term, I am not sure extensive work on the current 2D >>>> blocks-based BlockManager is worth our time. >>>> >>>> Joris >>>> >>>> Op di 11 jun. 2019 om 22:38 schreef Brock Mendel < >>>> jbrockmendel at gmail.com>: >>>> >>>>> I've been working on arithmetic/comparison bugs and more recently on >>>>> performance problems caused by fixing some of those bugs. After trying >>>>> less-invasive approaches, I've concluded a fairly big fix is called for. >>>>> This is an RFC for that proposed fix. >>>>> >>>>> ------ >>>>> In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by >>>>> making DataFrame arithmetic ops operate column-by-column, dispatching to >>>>> the Series implementations. This led to a significant performance hit for >>>>> operations on DataFrames with many columns (#24990, #26061). >>>>> >>>>> To restore the lost performance, we need to have these operations take >>>>> place >>>>> at the Block level. To prevent DataFrame behavior from diverging from >>>>> Series >>>>> behavior (again), we need to retain a single shared implementation. >>>>> >>>>> This is a proposal for how meet these two needs. >>>>> >>>>> Proposal: >>>>> - Allow EA to support 2D arrays >>>>> - Use PandasArray to back Block subclasses currently backed by ndarray >>>>> - Implement arithmetic and comparison ops directly on PandasArray, >>>>> then have Series, DataFrame, and Index ops pass through to the PandasArray >>>>> implementations. >>>>> >>>>> Fixes: >>>>> - Performance degradation in DataFrame ops (#24990, #26061) >>>>> - The last remaining inconsistencies between Index and Series ops >>>>> (#19322, #18824) >>>>> - Most of the xfailing arithmetic tests >>>>> - #22120: Transposing dataframe loses dtype and ExtensionArray >>>>> - #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray has >>>>> no reshape >>>>> - #23925 DataFrame Quantile Broken with Datetime Data >>>>> >>>>> Other Upsides: >>>>> - Series constructor could dispatch to pd.array, de-duplicating a lot >>>>> of code. >>>>> - Easier to move to Arrow backend if Blocks are numpy-naive. >>>>> - Make EA closer to a drop-in replacement for np.ndarray, necessary if >>>>> we want e.g. xarray to find them directly useful (#24716, #24583) >>>>> - Block/BlockManager simplifications, see below. >>>>> >>>>> Downsides: >>>>> - Existing constructors assume 1D >>>>> - Existing downstream authors assume 1D >>>>> - Reduction ops (of which there aren't many) don't have axis kwarg ATM >>>>> - But for PandasArray they just pass through to nanops, which >>>>> already have+test the axis kwargs >>>>> - For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one >>>>> implementing the reductions and am OK with this extra complication. >>>>> >>>>> Block Simplifications: >>>>> - Blocks have three main attributes: values, mgr_locs, and ndim >>>>> - ndim is _usually_ the same as values.ndim, the exceptions being for >>>>> cases where type(values) is restricted to 1D >>>>> - Without these restrictions, we can get rid of: >>>>> - Block.ndim, associated kludgy ndim-checking code >>>>> - numerous can-this-be-reshaped/transposed checks and special cases >>>>> in Block and BlockManager code (which are buggy anyway, e.g. #23925) >>>>> - With ndim gone, we can then get rid of mgr_locs! >>>>> - The blocks themselves never use mgr_locs except when passing to >>>>> their own constructors. >>>>> - mgr_locs makes _much_ more sense as an attribute of the >>>>> BlockManager >>>>> - With mgr_locs gone, Block becomes just a thin wrapper around an EA >>>>> >>>>> Implementation Strategy: >>>>> - Remove the 1D restriction >>>>> - Fairly small tweak, EA subclass must define `shape` instead of >>>>> `__len__`; other attrs define in terms of shape. >>>>> - Define `transpose`, `T`, `reshape`, and `ravel` >>>>> - With this done, several tasks can proceed in parallel: >>>>> - simplifications in core.internals, as special-cases for 1D-only >>>>> can be removed >>>>> - implement and test arithmetic ops on PandasArray >>>>> - back Blocks with PandasArray >>>>> - back Index (and numeric subclasses) with PandasArray >>>>> - Change DataFrame, Series, Index ops to pass through to underlying >>>>> Blocks/PandasArrays >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Wed Jun 12 12:17:44 2019 From: jeffreback at gmail.com (Jeff Reback) Date: Wed, 12 Jun 2019 12:17:44 -0400 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: The reason we have Blocks in the first place is that performance on same dtypes is much better on 2D containers. I don't think we can match hold-as-1D and do ops dispatched via python in a performant way. So longer term the solution is to use a table of 1D and use pyarrow (as holder with kernels to operate) or hold as 1D and use something like numba to perform the kernel operations. So we either need to keep the Blocks around as a way to hold things and do block operations (e.g. we actually hold things as 2D), or change to holding 1D as indicated above. Now we currently have a hybrid approach. numpy array backed blocks are 2D, while EA arrays are 1D. I fully agree this hybrid approach is, and has been the cause of many issues. Our contract on EA is 1D and I agree that changing this is not a good idea (at least publicly). So here's another proposal (a bit half-baked but....): You *could* build a single dtyped container that actually holds the 1D arrays themselves). Then you could put EA arrays and numpy arrays on the same footing. Meaning each 'Block' would be exactly the same. - This would make operations the *same* across all 'Blocks', reducing complexity - We could simply take views on 2D numpy arrays to actually avoid a performance penaltly of copying (as we can construct from a 2D numpy array a lot); this causes some aggregation ops to be much slower that if we actually copy, but that has a cost too - Ops could be defined on EA & Pandas Arrays; these can then operate array-by-array (within a Block), or using numba we could implement the ops in a way that we can get a pretty big speedup for a particular kernel Jeff On Wed, Jun 12, 2019 at 11:56 AM Tom Augspurger wrote: > > > On Wed, Jun 12, 2019 at 9:46 AM Brock Mendel > wrote: > >> TL;DR: >> >> > So while I agree that Blocks being backed by a maybe 1D / maybe 2D >> array causes no end of headaches >> >> For readers who don't find the performance issue compelling, the bugs and >> complexity this addresses should be compelling. >> >> -------- >> >> > Could we instead dispatch both Series and DataFrame ops to Block ops >> (which then >> do the op on the ndarray or dispatch to the EA)? >> >> @TomAugspurger Yes, though as mentioned in the OP, my attempts so far to >> make this work have failed. >> >> This suggestion boils down to effectively implementing these ops on >> Block, which is the opposite of the direction we want to be taking the >> Block classes. In terms of Separation of Concerns it makes much more sense >> for the array-like operations to be defined on a dedicated array class, in >> this case PandasArray. >> > > I think we're in agreement here. > > Moreover, implementing them on PandasArray gives us "for free" consistency >> between Series/DataFrame, Index, and PandasArray ops, whereas implementing >> them on Block gives only Series/DataFrame consistency. >> >> > 10% performance boost for those (I?m taking that figure from one of >> your comments in #24990). >> >> @WillAyd that comment referred to the cost of instantiating the >> DataFrame, not the arithmetic op. Earlier in that same comment I refer to >> the arithmetic op as being 10x slower, not 10% slower. >> >> > I?ve done zero work with blocks and I think they definitely come at an >> extra development / maintenance cost. >> >> I've done a bunch of work with blocks, mostly trying to get code _out_ of >> them. Ignore the entire performance issue: allowing EA to be 2D (heck, >> even restricted to (1, N) and (N, 1) would be enough!) would let us rip out >> so much (buggy) code I'll shed tears of joy. >> >> > Stepping back a bit, I see two potential issues we'd like to solve > > 1. The current structure of > > - Container (dataframe, series, index) -> > - Block (DataFrame / Series only) -> > - Array (ndarray or EA) > > is bad for two reasons: first, Indexes don't have Blocks; this argues for > putting more functionality on the Array, to share code between all the > containers; second, Array can be an ndarray or an EA. They're different > enough that > EA isn't a drop-in replacement for ndarray. > > 2. Arrays being either 1D or 2D causes many issues. > > A few questions > > Q1: Do those two issues accurately capture your concerns as well? > Q2: Can you clarify: with 2D EAs would *all* EAs stored within pandas be > 2D internally (and Series / Index would squeeze before data gets back to > the user)? Otherwise, I don't see how we get the internal simplification. > Q3: What do you think about a simple, private PandasArray-like thing that > *is* allowed to be 2D, and itself wraps a 2D ndarray? That solves my > problem 1, but doesn't address problem 2. > > Tom > > > > >> On Wed, Jun 12, 2019 at 7:53 AM William Ayd via Pandas-dev < >> pandas-dev at python.org> wrote: >> >>> I?m wary to expand operations done at the Block level. As a core >>> developer for over a year now, I?ve done zero work with blocks and I think >>> they definitely come at an extra development / maintenance cost. >>> >>> I think wide DataFrames are the exception rather than the norm so it?s >>> probably not worth code to eek out a 10% performance boost for those (I?m >>> taking that figure from one of your comments in #24990). >>> >>> - Will >>> >>> On Jun 11, 2019, at 10:08 PM, Tom Augspurger >>> wrote: >>> >>> One general question, motivated by Joris' same concern about the future >>> simplified BlockManager: why does block-based, rather than column-based, >>> ops >>> require 2D Extension Arrays? You say >>> >>> > by making DataFrame arithmetic ops operate column-by-column, >>> dispatching to >>> > the Series implementations. >>> >>> Could we instead dispatch both Series and DataFrame ops to Block ops >>> (which then >>> do the op on the ndarray or dispatch to the EA)? If I understand your >>> proposal >>> correctly, then you still have the general DataFrame -> Block -> Array >>> nesting >>> doll. It seems like that should work equally well with our current mix >>> of 2-D >>> and 1-D blocks. >>> >>> So while I agree that Blocks being backed by a maybe 1D / maybe 2D array >>> causes >>> no end of headaches, I don't see why block-based ops need 2D EAs (though >>> I'm not >>> especially familiar with this area; I could easily be missing something >>> basic). >>> >>> - Tom >>> >>> On Tue, Jun 11, 2019 at 5:25 PM Stephan Hoyer wrote: >>> >>>> Indeed, it's worth considering if perhaps it would be OK to have a >>>> performance regression for very wide dataframes instead. >>>> >>>> With regards to xarray, 2D extension arrays are interesting but still >>>> not particularly helpful. We would still need a wrapper to make them fully >>>> N-D, which we need for our data model. >>>> >>>> On Tue, Jun 11, 2019 at 6:18 PM Joris Van den Bossche < >>>> jorisvandenbossche at gmail.com> wrote: >>>> >>>>> Hi Brock, >>>>> >>>>> Thanks a lot for starting this discussion and the detailed proposal! >>>>> >>>>> I will try to look at it in more detail tomorrow, but one general >>>>> remark: from time to time, we talked about "getting rid of the >>>>> BlockManager" or "simplifying the BlockManager" (although I am not sure if >>>>> there is any specific github issue about it, might be from in-person >>>>> discussions). One of the interpretations of that (or at least how I >>>>> understood those discussions) was to get away of the 2D block based >>>>> internals, and go to a simpler "table as collection of 1D arrays" model. >>>>> This would also enable a simplication of the internals / BlockManager and >>>>> many of the other items you mention. >>>>> >>>>> So I think we should at least compare a more detailed version of what >>>>> I described above against your proposal. As if we would want to go in that >>>>> direction long term, I am not sure extensive work on the current 2D >>>>> blocks-based BlockManager is worth our time. >>>>> >>>>> Joris >>>>> >>>>> Op di 11 jun. 2019 om 22:38 schreef Brock Mendel < >>>>> jbrockmendel at gmail.com>: >>>>> >>>>>> I've been working on arithmetic/comparison bugs and more recently on >>>>>> performance problems caused by fixing some of those bugs. After trying >>>>>> less-invasive approaches, I've concluded a fairly big fix is called for. >>>>>> This is an RFC for that proposed fix. >>>>>> >>>>>> ------ >>>>>> In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by >>>>>> making DataFrame arithmetic ops operate column-by-column, dispatching to >>>>>> the Series implementations. This led to a significant performance hit for >>>>>> operations on DataFrames with many columns (#24990, #26061). >>>>>> >>>>>> To restore the lost performance, we need to have these operations >>>>>> take place >>>>>> at the Block level. To prevent DataFrame behavior from diverging >>>>>> from Series >>>>>> behavior (again), we need to retain a single shared implementation. >>>>>> >>>>>> This is a proposal for how meet these two needs. >>>>>> >>>>>> Proposal: >>>>>> - Allow EA to support 2D arrays >>>>>> - Use PandasArray to back Block subclasses currently backed by ndarray >>>>>> - Implement arithmetic and comparison ops directly on PandasArray, >>>>>> then have Series, DataFrame, and Index ops pass through to the PandasArray >>>>>> implementations. >>>>>> >>>>>> Fixes: >>>>>> - Performance degradation in DataFrame ops (#24990, #26061) >>>>>> - The last remaining inconsistencies between Index and Series ops >>>>>> (#19322, #18824) >>>>>> - Most of the xfailing arithmetic tests >>>>>> - #22120: Transposing dataframe loses dtype and ExtensionArray >>>>>> - #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray >>>>>> has no reshape >>>>>> - #23925 DataFrame Quantile Broken with Datetime Data >>>>>> >>>>>> Other Upsides: >>>>>> - Series constructor could dispatch to pd.array, de-duplicating a lot >>>>>> of code. >>>>>> - Easier to move to Arrow backend if Blocks are numpy-naive. >>>>>> - Make EA closer to a drop-in replacement for np.ndarray, necessary >>>>>> if we want e.g. xarray to find them directly useful (#24716, #24583) >>>>>> - Block/BlockManager simplifications, see below. >>>>>> >>>>>> Downsides: >>>>>> - Existing constructors assume 1D >>>>>> - Existing downstream authors assume 1D >>>>>> - Reduction ops (of which there aren't many) don't have axis kwarg ATM >>>>>> - But for PandasArray they just pass through to nanops, which >>>>>> already have+test the axis kwargs >>>>>> - For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one >>>>>> implementing the reductions and am OK with this extra complication. >>>>>> >>>>>> Block Simplifications: >>>>>> - Blocks have three main attributes: values, mgr_locs, and ndim >>>>>> - ndim is _usually_ the same as values.ndim, the exceptions being for >>>>>> cases where type(values) is restricted to 1D >>>>>> - Without these restrictions, we can get rid of: >>>>>> - Block.ndim, associated kludgy ndim-checking code >>>>>> - numerous can-this-be-reshaped/transposed checks and special >>>>>> cases in Block and BlockManager code (which are buggy anyway, e.g. #23925) >>>>>> - With ndim gone, we can then get rid of mgr_locs! >>>>>> - The blocks themselves never use mgr_locs except when passing to >>>>>> their own constructors. >>>>>> - mgr_locs makes _much_ more sense as an attribute of the >>>>>> BlockManager >>>>>> - With mgr_locs gone, Block becomes just a thin wrapper around an EA >>>>>> >>>>>> Implementation Strategy: >>>>>> - Remove the 1D restriction >>>>>> - Fairly small tweak, EA subclass must define `shape` instead of >>>>>> `__len__`; other attrs define in terms of shape. >>>>>> - Define `transpose`, `T`, `reshape`, and `ravel` >>>>>> - With this done, several tasks can proceed in parallel: >>>>>> - simplifications in core.internals, as special-cases for 1D-only >>>>>> can be removed >>>>>> - implement and test arithmetic ops on PandasArray >>>>>> - back Blocks with PandasArray >>>>>> - back Index (and numeric subclasses) with PandasArray >>>>>> - Change DataFrame, Series, Index ops to pass through to underlying >>>>>> Blocks/PandasArrays >>>>>> _______________________________________________ >>>>>> Pandas-dev mailing list >>>>>> Pandas-dev at python.org >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>> >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Jun 12 16:41:49 2019 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 12 Jun 2019 22:41:49 +0200 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: I find the idea of paying people (the core devs) for their maintenance work intriguing, but also see a lot of obstacles (the ones Ralf mentioned, what part of our volunteer time is considered maintenance work, who are active maintainers, do we tie certain expectations to rewarding money, ...?). I am not sure how it would work out in practice. In any case something to discuss, as we should also discuss other possible ways that we might want to spend the money if not paying the current maintainers. Just to be clear: even if we go with a NumFOCUS - Tidelift agreement, we can decide to pay ourselves instead of letting Tidelift directly do that. So that discussion can be done separately from deciding on going forward with NumFOCUS making a contract with Tidelift (I don't think I heard any pushback on that aspect itself, but just wanted to state this explicitly). Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From gfyoung17 at gmail.com Wed Jun 12 16:43:44 2019 From: gfyoung17 at gmail.com (G Young) Date: Wed, 12 Jun 2019 13:43:44 -0700 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: I second Joris' point here. Logistically, it seems easier to determine the details of where the $$$ goes AFTER it comes through from Tidelift to NUMFOCUS On Wed, Jun 12, 2019 at 1:42 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > I find the idea of paying people (the core devs) for their maintenance > work intriguing, but also see a lot of obstacles (the ones Ralf mentioned, > what part of our volunteer time is considered maintenance work, who are > active maintainers, do we tie certain expectations to rewarding money, > ...?). I am not sure how it would work out in practice. In any case > something to discuss, as we should also discuss other possible ways that we > might want to spend the money if not paying the current maintainers. > > Just to be clear: even if we go with a NumFOCUS - Tidelift agreement, we > can decide to pay ourselves instead of letting Tidelift directly do that. > So that discussion can be done separately from deciding on going forward > with NumFOCUS making a contract with Tidelift (I don't think I heard any > pushback on that aspect itself, but just wanted to state this explicitly). > > Joris > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Jun 12 16:55:52 2019 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 12 Jun 2019 22:55:52 +0200 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: Op wo 12 jun. 2019 om 18:18 schreef Jeff Reback : > ... > So here's another proposal (a bit half-baked but....): > > You *could* build a single dtyped container that actually holds the 1D > arrays themselves). Then you could put EA arrays and numpy arrays on the > same footing. Meaning each > 'Block' would be exactly the same. > > - This would make operations the *same* across all 'Blocks', reducing > complexity > - We could simply take views on 2D numpy arrays to actually avoid a > performance penaltly of copying (as we can construct from a 2D numpy array > a lot); this causes some aggregation ops to be much slower that if we > actually copy, but that has a cost too > - Ops could be defined on EA & Pandas Arrays; these can then operate > array-by-array (within a Block), or using numba we could implement the ops > in a way that we can get a pretty big speedup for a particular kernel > How would this proposal avoid the above-mentioned performance implication of doing ops column-by-column? In general, I think we should try to do a few basic benchmarks on what the performance impact would be for some typical use cases when all ops are done column-by-column / all columns are stored as separate blocks (Jeff had a branch at some point that made this optional). To have a better idea of the (dis)advantages for the different proposals. Brock, can you given your thoughts about the idea of having _only_ 1D Blocks? That could also solve a lot of the complexity of the internals (no 1D vs 2D) and have many of the advantages you mentioned in the first email, but I don't think you really answered that aspect. Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From garcia.marc at gmail.com Thu Jun 13 10:18:51 2019 From: garcia.marc at gmail.com (Marc Garcia) Date: Thu, 13 Jun 2019 15:18:51 +0100 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: >From my side, happy to receive funds from anyone as far as they don't expect anything unreasonable in exchange. And happy with Tidelift conditions, sounds like a great deal to me. To me it makes sense to spend the money in grants, like the NumFOCUS grant. Based on the GSoC experience, where nobody from the team was interested in mentoring junior people, I guess the grants would mainly be used by maintainers or senior developers. While we're still far from FTEs, I think the money we're discussing starts to be significant for people who could do freelance work,... Based on Wes comments, I'd also be happy to give grants for work already done, if anyone would like to claim that. I think other formulas like giving the money to maintainers directly,... will be too complex/controversial. Regarding what it's expected from the money, I think it's worth to see what Quansight is doing. We can check with Travis the exact details, but my understanding is that companies can give funds to Quansight to get the things in the project roadmaps done. The roadmaps of the projects they support are here: https://www.quansight.com/projects I think having a "paid work" roadmap would be great, so as a team we define what are the top priorities, and if we find the right people we use the funds for the things are not getting progress with volunteer work. I think Wes, Travis and other people will have experience in this format, happy to follow their advice. Slightly unrelated, but we're getting some progress with the docs. We're building them in azure (available now at https://dev.pandas.io) and getting close to build without warnings (and validate them in the CI), so we can speed up changes. After that I'll work on changing the theme, and will propose to have a "fancy" home page which IMO we can use to give credit to companies supporting the project. Like what we're doing at https://pandas.pydata.org/about.html, but more visible and hopefully with many more visits. You can see a proof of concept from some time ago here: http://dev.pandas.io/pandas-sphinx-theme/pr-datapythonista_base/. If people is happy with the idea, we should probably define formally which company is listed there (companies employing maintainers to work on pandas, big donors,..). On Wed, Jun 12, 2019 at 9:46 PM G Young wrote: > I second Joris' point here. Logistically, it seems easier to determine > the details of where the $$$ goes AFTER it comes through from Tidelift to > NUMFOCUS > > On Wed, Jun 12, 2019 at 1:42 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> I find the idea of paying people (the core devs) for their maintenance >> work intriguing, but also see a lot of obstacles (the ones Ralf mentioned, >> what part of our volunteer time is considered maintenance work, who are >> active maintainers, do we tie certain expectations to rewarding money, >> ...?). I am not sure how it would work out in practice. In any case >> something to discuss, as we should also discuss other possible ways that we >> might want to spend the money if not paying the current maintainers. >> >> Just to be clear: even if we go with a NumFOCUS - Tidelift agreement, we >> can decide to pay ourselves instead of letting Tidelift directly do that. >> So that discussion can be done separately from deciding on going forward >> with NumFOCUS making a contract with Tidelift (I don't think I heard any >> pushback on that aspect itself, but just wanted to state this explicitly). >> >> Joris >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Thu Jun 13 13:19:46 2019 From: jbrockmendel at gmail.com (Brock Mendel) Date: Thu, 13 Jun 2019 10:19:46 -0700 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: > can you given your thoughts about the idea of having _only_ 1D Blocks? That could also solve a lot of the complexity of the internals (no 1D vs 2D) and have many of the advantages you mentioned in the first email, but I don't think you really answered that aspect. My read from the earlier email was that you were going to present a more detailed proposal for what this would look like. Going all-1D would solve the sometimes-1D-sometimes-2D problem, but I think it would cause real problems for transpose and reductions with axis=1 (I've seen it argued that these are not common use cases). It also wouldn't change the fact that we need BlockManager or something like it to do alignment. Getting array-like operations out of BlockManager/Block and into PandasArray could work with the all-1D idea. > 1. The current structure of [...] They're different enough that EA isn't a drop-in replacement for ndarray. > 2. Arrays being either 1D or 2D causes many issues. > Q1: Do those two issues accurately capture your concerns as well? Pretty much, yes. > Q2: Can you clarify: with 2D EAs would *all* EAs stored within pandas be 2D internally (and Series / Index would squeeze before data gets back to the user)? Otherwise, I don't see how we get the internal simplification. My thought was that only EAs backing Blocks inside DataFrames would get reshaped. Everything else would retain their existing dimensions. It isn't clear to me why you'd want 2D backing Index/Series, though I'm open to being convinced. Demonstrating the internal simplification may require a Proof of Concept. > Our contract on EA is 1D and I agree that changing this is not a good idea (at least publicly). Another slightly hacky option would be to secretly allow EAs to be temporarily 2D while backing DataFrame Blocks, but restrict that to just (N, 1) or (1, N). That wouldn't do anything to address the arithmetic performance, but might let us de-kludge the 1D/2D code in core.internals. On Wed, Jun 12, 2019 at 1:56 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Op wo 12 jun. 2019 om 18:18 schreef Jeff Reback : > >> ... >> > So here's another proposal (a bit half-baked but....): >> >> You *could* build a single dtyped container that actually holds the 1D >> arrays themselves). Then you could put EA arrays and numpy arrays on the >> same footing. Meaning each >> 'Block' would be exactly the same. >> >> - This would make operations the *same* across all 'Blocks', reducing >> complexity >> - We could simply take views on 2D numpy arrays to actually avoid a >> performance penaltly of copying (as we can construct from a 2D numpy array >> a lot); this causes some aggregation ops to be much slower that if we >> actually copy, but that has a cost too >> - Ops could be defined on EA & Pandas Arrays; these can then operate >> array-by-array (within a Block), or using numba we could implement the ops >> in a way that we can get a pretty big speedup for a particular kernel >> > > How would this proposal avoid the above-mentioned performance implication > of doing ops column-by-column? > > In general, I think we should try to do a few basic benchmarks on what the > performance impact would be for some typical use cases when all ops are > done column-by-column / all columns are stored as separate blocks (Jeff had > a branch at some point that made this optional). To have a better idea of > the (dis)advantages for the different proposals. > > Brock, can you given your thoughts about the idea of having _only_ 1D > Blocks? That could also solve a lot of the complexity of the internals (no > 1D vs 2D) and have many of the advantages you mentioned in the first email, > but I don't think you really answered that aspect. > > Joris > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Thu Jun 13 13:28:59 2019 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 13 Jun 2019 19:28:59 +0200 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: Op do 13 jun. 2019 om 19:19 schreef Brock Mendel : > > can you given your thoughts about the idea of having _only_ 1D Blocks? > That could also solve a lot of the complexity of the internals (no 1D vs > 2D) and have many of the advantages you mentioned in the first email, but I > don't think you really answered that aspect. > > My read from the earlier email was that you were going to present a more > detailed proposal for what this would look like. > A bit simplistic, but basically the same as your proposal but all 1D instead of all 2D: all blocks are 1D (solves inconsistency), use custom EA for the numpy based ones (the current PandasArray), ops are defined on the arrays and Series/Index/DataFrame dispatch to them (which also ensures that a column in a dataframe or a series will be handled consistently). The rest of your Block simplications and Implementation strategy could (I think) more or less be translated to the above as well. > Going all-1D would solve the sometimes-1D-sometimes-2D problem, but I > think it would cause real problems for transpose and reductions with axis=1 > (I've seen it argued that these are not common use cases). > For a DataFrame with only a single dtype, those operations will certainly be a lot slower, but for heterogeneous maybe not that much. > It also wouldn't change the fact that we need BlockManager or something > like it to do alignment. Getting array-like operations out of > BlockManager/Block and into PandasArray could work with the all-1D idea. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Thu Jun 13 14:29:51 2019 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Thu, 13 Jun 2019 13:29:51 -0500 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: On Thu, Jun 13, 2019 at 12:20 PM Brock Mendel wrote: > > can you given your thoughts about the idea of having _only_ 1D Blocks? > That could also solve a lot of the complexity of the internals (no 1D vs > 2D) and have many of the advantages you mentioned in the first email, but I > don't think you really answered that aspect. > > My read from the earlier email was that you were going to present a more > detailed proposal for what this would look like. Going all-1D would solve > the sometimes-1D-sometimes-2D problem, but I think it would cause real > problems for transpose and reductions with axis=1 (I've seen it argued that > these are not common use cases). It also wouldn't change the fact that we > need BlockManager or something like it to do alignment. Getting array-like > operations out of BlockManager/Block and into PandasArray could work with > the all-1D idea. > > > 1. The current structure of [...] They're different enough that EA isn't > a drop-in replacement for ndarray. > > 2. Arrays being either 1D or 2D causes many issues. > > Q1: Do those two issues accurately capture your concerns as well? > > Pretty much, yes. > > > Q2: Can you clarify: with 2D EAs would *all* EAs stored within pandas be > 2D internally (and Series / Index would squeeze before data gets back to > the user)? Otherwise, I don't see how we get the internal simplification. > > My thought was that only EAs backing Blocks inside DataFrames would get > reshaped. Everything else would retain their existing dimensions. It > isn't clear to me why you'd want 2D backing Index/Series, though I'm open > to being convinced. > I think I was missing a subtle point; We'll still have a mix of 1-D and 2-D blocks under your proposal. What we *won't* have is cases where the Block.shape doesn't match the Block.values.shape? ``` In [9]: df = pd.DataFrame({"A": [1, 2], 'B': pd.array([1, 2], dtype='Int64')}) In [10]: df._data.blocks Out[10]: (IntBlock: slice(0, 1, 1), 1 x 2, dtype: int64, ExtensionBlock: slice(1, 2, 1), 1 x 2, dtype: Int64) In [11]: df._data.blocks[0].shape == df._data.blocks[0].values.shape Out[11]: True In [12]: df._data.blocks[1].shape == df._data.blocks[1].values.shape Out[12]: False ``` So assuming we want the laudable goal of Block.{ndim,shape} == Block.values.{ndim,shape}, we have two options 1. Allow BlockManager to store a mix of 1-D and 2-D blocks. 2. Allow EAs to be (at least internally) reshaped to 2D (possibly just (N, 1) or (1, N)). Is that fair? Doing number 1 sounds really bad / difficult, right? Demonstrating the internal simplification may require a Proof of Concept. > > > Our contract on EA is 1D and I agree that changing this is not a good > idea (at least publicly). > > Another slightly hacky option would be to secretly allow EAs to be > temporarily 2D while backing DataFrame Blocks, but restrict that to just > (N, 1) or (1, N). That wouldn't do anything to address the arithmetic > performance, but might let us de-kludge the 1D/2D code in core.internals. > I would be curious to see this. And I think it would help with the *regression* in arithmetic performance, right? Since ndarray-backed blocks would be allowed to be (N, P)? > On Wed, Jun 12, 2019 at 1:56 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Op wo 12 jun. 2019 om 18:18 schreef Jeff Reback : >> >>> ... >>> >> So here's another proposal (a bit half-baked but....): >>> >>> You *could* build a single dtyped container that actually holds the 1D >>> arrays themselves). Then you could put EA arrays and numpy arrays on the >>> same footing. Meaning each >>> 'Block' would be exactly the same. >>> >>> - This would make operations the *same* across all 'Blocks', reducing >>> complexity >>> - We could simply take views on 2D numpy arrays to actually avoid a >>> performance penaltly of copying (as we can construct from a 2D numpy array >>> a lot); this causes some aggregation ops to be much slower that if we >>> actually copy, but that has a cost too >>> - Ops could be defined on EA & Pandas Arrays; these can then operate >>> array-by-array (within a Block), or using numba we could implement the ops >>> in a way that we can get a pretty big speedup for a particular kernel >>> >> >> How would this proposal avoid the above-mentioned performance implication >> of doing ops column-by-column? >> >> In general, I think we should try to do a few basic benchmarks on what >> the performance impact would be for some typical use cases when all ops are >> done column-by-column / all columns are stored as separate blocks (Jeff had >> a branch at some point that made this optional). To have a better idea of >> the (dis)advantages for the different proposals. >> >> Brock, can you given your thoughts about the idea of having _only_ 1D >> Blocks? That could also solve a lot of the complexity of the internals (no >> 1D vs 2D) and have many of the advantages you mentioned in the first email, >> but I don't think you really answered that aspect. >> >> Joris >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From me at pietrobattiston.it Thu Jun 13 14:58:28 2019 From: me at pietrobattiston.it (Pietro Battiston) Date: Thu, 13 Jun 2019 20:58:28 +0200 Subject: [Pandas-dev] Tidelift In-Reply-To: References: <57676A0D-A1E8-441B-B60D-AA64ADD9184B@icloud.com> Message-ID: <5fafd563dee86aa1907ec90faab4b04be6a1d135.camel@pietrobattiston.it> I agree with Jeff that having the money gives us options that we don't have otherwise. So I'm generally +1 on the Tidelift offer. As to how to spend it, I think - we can certainly create topic-specific grants for freelancers when it makes sense, we just do not want the pressure to necessarily create grants just because we have the money - for what is not spent on grants, I suspect a simple online sheet where each core dev each month states a) number of hours devoted to pandas and b) use for the money ("in my pocket" vs. "in personal Python-related tickets/expenses" vs. "in a fund for pandas expenses" - although the latter options would probably be the same from the project's side) would I think solve the hassle-profit tradeoff in the simplest way. Pietro Il giorno mar, 11/06/2019 alle 21.34 -0400, Jeff Reback ha scritto: > I > > One risk to be aware of is that if a high profile > > project like pandas take's TL's money and none of the maintainers > > pay > > themselves with it, then the monthly number may not have as much of > > a > > chance of increasing (since current or prospective TL customers may > > observe that the subscription dollars aren't being used in the way > > that is being pitched). > > I actually see the exact opposite here. A project of pandas stature > that decides to better the project is a pretty respectable goal. > > I believe we would be in the letter and more importantly the spirit > of Tidelift for the pandas project itself to take this burden & > receive the income. Having the project itself with the combined > force of multiple maintainers actually would be much more comforting > (from the customer's perspective), than a single maintainer (who may > not always be there). > > Furthermore, we could use these funds for the combined benefit of the > project, mainly I think for gatherings like the upcoming sprints. I > am not sure many of you know, but pandas has not actively solicited > *any* monies, and only received 2 largish contributions over the > years, which are the majority of our current funds. The tidelift > agreement looks to provide a stream of income which we currently do > not have. With an income stream we have options; without we don't. > > We can always decide to remunerate maintainers who contribute to this > effort, though, this should be a separate discussion. > > Jeff > > On Tue, Jun 11, 2019 at 5:10 PM Wes McKinney > wrote: > > On Tue, Jun 11, 2019 at 2:26 PM Andy Ray Terrel < > > andy.terrel at gmail.com> wrote: > > > > > > > > > > > > On Tue, Jun 11, 2019 at 12:51 PM Wes McKinney < > > wesmckinn at gmail.com> wrote: > > >> > > >> hi, > > >> > > >> On Tue, Jun 11, 2019 at 11:16 AM Ralf Gommers < > > ralf.gommers at gmail.com> wrote: > > >> > > > >> > > > >> > > > >> > On Tue, Jun 11, 2019 at 4:56 PM Andy Ray Terrel < > > andy.terrel at gmail.com> wrote: > > >> >> > > >> >> While the original lifter agreement was an individual > > contract, in our negotiations with Tidelift, NumFOCUS has > > explicitly sought a model that allows the project to split the > > money how they prefer. This was always Tidelift's intention, it was > > just faster and easier to scale to focus on paying individuals. > > >> > > > >> > > > >> > +1 the project decides for themselves is the intent and a good > > principle. > > >> > > > >> >> > > >> >> I do like the idea of paying for maintence work, I would > > recommend we set up folks as contractors with NumFOCUS rather than > > just pocketing money. It will give a lot more legal protection. > > Then if some folks don't want to take the cash you they can donate > > their time and be recognized as in-kind donations, which might have > > some tax deductions. > > >> > > > >> > > > >> > Keep in mind that this has a lot of potential issues. > > Examples: > > >> > 1. Who decides who gets paid, and how? The pandas repo has > > 1500+ contributors. Lots of potential for friction over small > > amount of $. > > >> > > >> More or less the _entire_ point of Tidelift is to incentivize > > people > > >> to do more maintenance work. I think it's worth at least > > attempting to > > >> use this money for its intended economic purpose. > > >> > > >> The maintainers are, as a first approximation, the ~10-15 active > > core > > >> members listed on > > >> > > >> https://github.com/pandas-dev/pandas-governance > > >> > > >> IMHO those are the people that should get paid (going forward) > > -- if > > >> contributors are more motivated to become core team members / > > >> maintainers as a result of the Tidelift money, then it has had > > the > > >> desired outcome. > > > > > > > > > I would suggest leaving the decision to the project core team > > with the project Numfocus committee to be the overseer of the > > implementation. > > > > > > > Yes, of course, that's the governance that we have in place. I am > > just > > stressing that we should try to honor the intent of the asset that > > is > > being purchased by Tidelift customers. Tidelift is telling their > > customers that the money they are paying is going to end up in the > > pockets of the project maintainers > > > > https://tidelift.com/about/lifter > > > > If the pandas core team wishes to deny themselves the income > > (which, > > divided up, isn't going to be a life-changing amount of money) > > that's > > their prerogative -- I just wanted to be clear about where I stand > > on > > it, and there's nothing immoral about wanting to be compensated for > > one's time (given how much volunteered time has already gone > > uncompensated). One risk to be aware of is that if a high profile > > project like pandas take's TL's money and none of the maintainers > > pay > > themselves with it, then the monthly number may not have as much of > > a > > chance of increasing (since current or prospective TL customers may > > observe that the subscription dollars aren't being used in the way > > that is being pitched). > > > > >> > > >> > > >> > 2. Many people have employment contracts, those typically > > forbid contracting on the side. So inherently unfair to distribute > > only to those who are in a position to accept the money. > > >> > > >> This is true -- at least Jeff and maybe others fall into this > > >> category. In such cases their "cut" of the maintenance funds can > > go > > >> into the communal fund to pay for other stuff > > >> > > > > > > Yes such accommodation will need to be worked out. > > > > > >> > > >> > 3. You're now introducing lots of extra paperwork and admin, > > both directly and indirectly (who wants to deal with the extra > > complications when filing your taxes?). > > >> > > >> Hopefully we're talking just a 1099 from NumFOCUS with a single > > number > > >> to type in, but I'm the wrong person to judge since my taxes are > > more > > >> complicated than most people's =) > > > > > > > > > Generally it is done that way for US based folks and for folks > > out of the US we tend to let them handle their own taxes. We would > > need to work that out. > > > > > > Additionally, as in all dealings with businesses, we do the extra > > paperwork for the other benefits such as limiting the liability of > > a maintainer. > > > > > >> > > >> > > >> > 4. It may create other weird social dynamics. E.g. if money is > > now directly coupled to a commit bit, that makes the "who do we > > give commit rights and when" a potentially more loaded question. > > >> > > >> I think this is where the honest self-reporting of time spent > > comes > > >> in. The goal is to increase the average number of maintainer > > hours per > > >> month/year. It's sort of like a crypto-mining pool, but for open > > >> source software maintenance =) Obviously maintainers are > > accountable > > >> to the rest of the core team to behave with integrity > > >> (professionalism, honesty, etc.) or they can be voted to be > > removed if > > >> they are found to be dishonest. > > > > > > > > >> > > >> > > > >> > > >> > And, dividing it into N chunks, the funding becomes nice beer > > money and a thank you for volunteering. Could be exactly what you'd > > prefer as a team. But that's imho more in line with the current > > version of Patreon or GitHub Sponsors rather then with what > > Tidelift is aiming for. > > >> > > > >> > I'd like the idea of "paying for maintenance" if there were > > enough money to employ people. But realistically, that will take > > many years. The Tidelift slogan on this is unrealistic for a > > project like Pandas where maintenance effort is many FTEs; it's > > perhaps feasible for your typical Javascript library that's popular > > but small enough for one person maintaining it part-time. > > >> > > > >> >> > > >> >> It is something I would volunteer to help manage in order to > > learn how other projects might use the same techniques. > > >> >> > > >> >> -- Andy > > >> >> > > >> >> On Tue, Jun 11, 2019 at 9:13 AM Wes McKinney < > > wesmckinn at gmail.com> wrote: > > >> >>> > > >> >>> > How you allocate the money to each other is something you > > can debate privately > > >> >>> > > >> >>> On this, I'm sure that you could set up a lightweight > > virtual > > >> >>> "timesheet" so you can put yourselves "on the clock" when > > you're doing > > >> >>> project maintenance work (there are many of these online, I > > just read > > >> >>> about https://www.clockspot.com/ recently) to make time > > reporting a > > >> >>> bit more accurate > > >> >>> > > >> >>> On Tue, Jun 11, 2019 at 9:09 AM Wes McKinney < > > wesmckinn at gmail.com> wrote: > > >> >>> > > > >> >>> > Personally, I would recommend putting most of the money in > > your own > > >> >>> > pockets. The whole idea of Tidelift (as I understand it) > > is for the > > >> >>> > individuals doing work that is of importance to project > > users (to whom > > >> >>> > Tidelift is providing indemnification and "insurance" > > against defects) > > >> > > > >> > > > >> > Actually that's only partially true. Tidelift is paying for > > very specific things, that allow them to do aggregated reporting on > > licensing, dependencies, security vulnerabilities, release streams > > & release docs, etc. - basically the stuff that helps large > > corporations do due diligence and management of a large software > > stack. > > >> > > > >> > It is explicitly out of scope to work on bugs or enhancements > > in the NumFOCUS-Tidelift agreement (and working on particular > > technical items was never their intention). So "insurance against > > defects" isn't part of this, except in a very abstract sense of > > making the project healthier and therefore reducing the risk of it > > being abandoned or a lot more buggy on the many-year time scale. > > >> > > > >> > Cheers, > > >> > Ralf > > >> > > > >> > > > >> >>> > to get paid for their labor. So I think the most honest > > way to use the > > >> >>> > money is to put it in your respective bank accounts. If > > you've getting > > >> >>> > a little bit of money to spend on yourself, doesn't that > > make doing > > >> >>> > the maintenance work a bit less thankless? If you don't > > pay > > >> >>> > yourselves, I think it actually "breaks" Tidelift's pitch > > to customers > > >> >>> > which is that open source projects need to have a higher > > fraction of > > >> >>> > compensated maintenance and support work than they do now. > > >> >>> > > > >> >>> > How you allocate the money to each other is something you > > can debate privately > > >> >>> > > > >> >>> > On Tue, Jun 11, 2019 at 8:42 AM Joris Van den Bossche > > >> >>> > wrote: > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > Op di 11 jun. 2019 om 15:31 schreef Ralf Gommers < > > ralf.gommers at gmail.com>: > > >> >>> > >> > > >> >>> > >> > > >> >>> > >> > > >> >>> > >> On Tue, Jun 11, 2019 at 3:03 PM Tom Augspurger < > > tom.augspurger88 at gmail.com> wrote: > > >> >>> > >>> > > >> >>> > >>> > > >> >>> > >>> > > >> >>> > >>> On Tue, Jun 11, 2019 at 7:58 AM William Ayd via > > Pandas-dev wrote: > > >> >>> > >>>> > > >> >>> > >>>> Just some counterpoints to consider: > > >> >>> > >>>> > > >> >>> > >>>> - $ 3,000 a month isn?t really that much, and if it?s > > just a number that a well-funded company chose for us chances are > > they are benefiting from it way more than we are > > >> >>> > >> > > >> >>> > >> > > >> >>> > >> "it's not really that much" is something I don't agree > > with. It doesn't employ someone, but it's enough to pay for things > > like developer meetups, hiring an extra GSoC student if a good one > > happens to come along, paying a web dev for a full redesign of the > > project website, etc. Each of those things is in the $5,000 - > > %15,000 range, and it's _very_ nice to be able to do them without > > having to look for funding first. > > >> >>> > >> > > >> >>> > >> Tidelift is a small (now ~25 employees) company by the > > way, and they have a real understanding of the open source > > sustainability issues and seem dedicated to helping fix it. > > >> >>> > >> > > >> >>> > >>>> - There is no such thing as free money; we have to > > consider how to account for and actually manage it (perhaps > > mitigated somewhat by NumFocus) > > >> >>> > >>> > > >> >>> > >>> > > >> >>> > >>> Perhaps Ralph can share how this has gone for NumPy. I > > imagine it's not too work on their end, thanks to NumFOCUS. > > >> >>> > >> > > >> >>> > >> > > >> >>> > >> NumFOCUS handles receiving the money and associated > > admin. As the project you'll be responsible for the setup and > > ongoing tasks. For NumPy and SciPy I have done those tasks. It's a > > fairly minimal amount of work: > > https://github.com/numpy/numpy/pulls?q=is%3Apr+tidelift+is%3Aclosed > > . The main one was dealing with GitHub not recognizing our license, > > and you don't have that issue for Pandas (it's reported correctly > > as BSD-3 in the UI at https://github.com/pandas-dev/pandas). > > >> >>> > >> > > >> >>> > >> So it's probably a day of work for one person, to get > > familiar with the interface, check dependencies, release streams, > > paste in release notes, etc. And then ongoing maybe one or a couple > > of hours a month. So far it's been a much more effective way of > > spending time than, for example, grant writing. > > >> >>> > >> > > >> >>> > >>> > > >> >>> > >>>> > > >> >>> > >>>> - Advertising and ties to a corporate sponsorship may > > weaken the brand of pandas; at that point we may lose some > > creditability as open source volunteers > > >> >>> > >>> > > >> >>> > >>> > > >> >>> > >>> Anecdotally, I don't think that's how the community > > views Tidelift. My perception (from Twitter, blogs / comments) is > > that it's been well received. > > >> >>> > >> > > >> >>> > >> > > >> >>> > >> Agree, the feedback I've seen is all quite positive. > > >> >>> > > > > >> >>> > > > > >> >>> > > Additionally, I don't think there is any "advertisement" > > involved, at least not in the classical sense of adding adds for > > third-party companies in a side bar to our website for which we get > > money. Of course we will need to mention Tidelift in some way, e.g. > > in our sponsors / institutional partners section, but we already do > > that for some other companies as well (that employ core devs). > > >> >>> > > > > >> >>> > >> > > >> >>> > >> > > >> >>> > >>> > > >> >>> > >>>> > > >> >>> > >>>> - We don?t (AFAIK) have a plan on how to spend or > > allocate it > > >> >>> > >>>> > > >> >>> > >>>> Not totally against it but perhaps the last point > > above is the main sticking one. Do we have any idea how much we?d > > actually pocket out of the $ 3k they offer us and subsequently what > > we would do with it? Cover travel expenses? Support PyData > > conferences? Scholarships? > > >> >>> > >>> > > >> >>> > >>> > > >> >>> > >>> Agreed that we should set a purpose for this money > > (though, I have no objection to collecting while we set that > > dedicated purpose). > > >> >>> > >> > > >> >>> > >> > > >> >>> > > Indeed we need to discuss this, but I don't think we > > already need to know *exactly* what we want to do with it before > > setting up a contract with Tidelift. It's good for me to alraedy > > start discussing it now, but maybe in a separate thread? > > >> >>> > > > > >> >>> > >> > > >> >>> > >> For NumPy and SciPy we haven't earmarked the funds yet. > > It's nice to build up a buffer first. One thing I'm thinking of is > > that we're participating in Google Season of Docs, and are getting > > more high quality applicants than Google will accept. So we could > > pay one or two tech writers from the funds. Our website and high > > level docs (tutorial, restructuring of all docs to guide users > > better) sure could use it:) > > >> >>> > >> > > >> >>> > >> My abstract advice would be: pay for things that > > require money (like a dev meeting) or don't get done for free. > > Don't pay for writing code unless the case is extremely compelling, > > because that'll be a drop in the bucket. > > >> >>> > >> > > >> >>> > >> Cheers, > > >> >>> > >> Ralf > > >> >>> > >> > > >> >>> > >> > > >> >>> > >>> > > >> >>> > >>>> > > >> >>> > >>>> - Will > > >> >>> > >>>> > > >> >>> > >>>> On Jun 11, 2019, at 4:44 AM, Ralf Gommers < > > ralf.gommers at gmail.com> wrote: > > >> >>> > >>>> > > >> >>> > >>>> > > >> >>> > >>>> > > >> >>> > >>>> On Tue, Jun 11, 2019 at 10:15 AM Joris Van den > > Bossche wrote: > > >> >>> > >>>>> > > >> >>> > >>>>> The current page about pandas ( > > https://tidelift.com/lifter/search/pypi/pandas) mentions $3,000 > > dollar a month (but I am not fully sure this is what is already > > available from their current subscribers, or if it is a prospect). > > >> >>> > >>>> > > >> >>> > >>>> > > >> >>> > >>>> It's not just a prospect, that's what you should/will > > get. NumPy and SciPy get the listed amounts too. > > >> >>> > >>>> > > >> >>> > >>>> Agreed that the NumPy amount is not that much. The > > amount gets determined automatically; it's some combination of > > customer interest, dependency analysis and size of the API surface. > > >> >>> > >>>> > > >> >>> > >>>> The current amounts are: > > >> >>> > >>>> NumPy: $1000 > > >> >>> > >>>> SciPy: $2500 > > >> >>> > >>>> Pandas: $3000 > > >> >>> > >>>> Matplotlib: n.a. > > >> >>> > >>>> Scikit-learn: $1500 > > >> >>> > >>>> Scikit-image: $50 > > >> >>> > >>>> Statsmodels: $50 > > >> >>> > >>>> > > >> >>> > >>>> So there's an element of randomness, but the results > > are not completely surprising I think. The four libraries that get > > order thousands of dollars are the ones that large corporations are > > going to have the highest interest in. > > >> >>> > >>>> > > >> >>> > >>>> Cheers, > > >> >>> > >>>> Ralf > > >> >>> > >>>> > > >> >>> > >>>>> > > >> >>> > >>>>> > > >> >>> > >>>>> Op za 8 jun. 2019 om 22:54 schreef William Ayd < > > william.ayd at icloud.com>: > > >> >>> > >>>>>> > > >> >>> > >>>>>> What is the minimum amount we are asking for? The > > $1,000 a month for NumPy seems rather low and I thought previous > > emails had something in the range of $3k a month. > > >> >>> > >>>>>> > > >> >>> > >>>>>> I don?t think we necessarily need or would be that > > much improved by $12k per year so would rather aim higher if we are > > going to do this > > >> >>> > >>>>>> > > >> >>> > >>>>>> On Jun 7, 2019, at 12:53 PM, Joris Van den Bossche > > wrote: > > >> >>> > >>>>>> > > >> >>> > >>>>>> Hi all, > > >> >>> > >>>>>> > > >> >>> > >>>>>> We discussed this on the last dev chat, but putting > > it on the mailing list for those who were not present: we are > > planning to contact Tidelift to enter into a sponsor agreement for > > Pandas. > > >> >>> > >>>>>> > > >> >>> > >>>>>> The idea is to follow what NumPy (and recently also > > Scipy) did to have an agreement between Tidelift and NumFOCUS > > instead of an individual maintainer (see their announcement mail: > > https://mail.python.org/pipermail/numpy-discussion/2019-April/079370.html > > ). > > >> >>> > >>>>>> Blog with overview about Tidelift: > > https://blog.tidelift.com/how-to-start-earning-money-for-your-open-source-project-with-tidelift > > . > > >> >>> > >>>>>> > > >> >>> > >>>>>> We didn't discuss yet what to do specifically with > > those funds, that should still be discussed in the future. > > >> >>> > >>>>>> > > >> >>> > >>>>>> Cheers, > > >> >>> > >>>>>> Joris > > >> >>> > >>>>>> _______________________________________________ > > >> >>> > >>>>>> Pandas-dev mailing list > > >> >>> > >>>>>> Pandas-dev at python.org > > >> >>> > >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev > > >> >>> > >>>>>> > > >> >>> > >>>>>> > > >> >>> > >>>>> _______________________________________________ > > >> >>> > >>>>> Pandas-dev mailing list > > >> >>> > >>>>> Pandas-dev at python.org > > >> >>> > >>>>> https://mail.python.org/mailman/listinfo/pandas-dev > > >> >>> > >>>> > > >> >>> > >>>> > > >> >>> > >>>> _______________________________________________ > > >> >>> > >>>> Pandas-dev mailing list > > >> >>> > >>>> Pandas-dev at python.org > > >> >>> > >>>> https://mail.python.org/mailman/listinfo/pandas-dev > > >> >>> > >> > > >> >>> > >> _______________________________________________ > > >> >>> > >> Pandas-dev mailing list > > >> >>> > >> Pandas-dev at python.org > > >> >>> > >> https://mail.python.org/mailman/listinfo/pandas-dev > > >> >>> > > > > >> >>> > > _______________________________________________ > > >> >>> > > Pandas-dev mailing list > > >> >>> > > Pandas-dev at python.org > > >> >>> > > https://mail.python.org/mailman/listinfo/pandas-dev > > >> >>> _______________________________________________ > > >> >>> Pandas-dev mailing list > > >> >>> Pandas-dev at python.org > > >> >>> https://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From jbrockmendel at gmail.com Thu Jun 13 15:28:29 2019 From: jbrockmendel at gmail.com (Brock Mendel) Date: Thu, 13 Jun 2019 12:28:29 -0700 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: > I think I was missing a subtle point; We'll still have a mix of 1-D and 2-D blocks under your proposal. What we *won't* have is cases where the Block.shape doesn't match the Block.values.shape? Correct. For curious readers, "mix" here only means that both 1D Block and 2D blocks will exist. Within a DataFrame you will only ever see 2D blocks. Within a Series you will only ever find a single 1D Block. (at least under the current proposal; Tom's option 1 above would change this) > So assuming we want the laudable goal of Block.{ndim,shape} == Block.values.{ndim,shape}, Following that train of thought (hopefully this helps explain the promised simplifications): - now blocks don't need an ndim attribute or kwarg for the constructor - so their only attributes are mgr_locs and values - and hey, when do they ever use self.mgr_locs? Only when calling their own constructors! These make more sense as a BlockManager attribute anyway. - so... if Block's only attribute is `values`, maybe we can get rid of Block altogether? > 1. Allow BlockManager to store a mix of 1-D and 2-D blocks. > 2. Allow EAs to be (at least internally) reshaped to 2D (possibly just (N, 1) or (1, N)). > > Is that fair? Doing number 1 sounds really bad / difficult, right? Option 1: I haven't really thought about, but my intuition is that it would cause new headaches. I expect this would show up in the reshape/concat code, which I'm not as familiar with. Option 2: allowing (N,1) and (1,N) would give us most of the discussed simplifications (and bugfixes) in Block/BlockManager. Even without the performance considerations, I would consider this a massive win. > And I think it would help with the *regression* in arithmetic performance, right? Since ndarray-backed blocks would be allowed to be (N, P)? Not directly, but I think it would make one of the previously-tried-and-failed approaches less problematic. On Thu, Jun 13, 2019 at 11:30 AM Tom Augspurger wrote: > > > On Thu, Jun 13, 2019 at 12:20 PM Brock Mendel > wrote: > >> > can you given your thoughts about the idea of having _only_ 1D Blocks? >> That could also solve a lot of the complexity of the internals (no 1D vs >> 2D) and have many of the advantages you mentioned in the first email, but I >> don't think you really answered that aspect. >> >> My read from the earlier email was that you were going to present a more >> detailed proposal for what this would look like. Going all-1D would solve >> the sometimes-1D-sometimes-2D problem, but I think it would cause real >> problems for transpose and reductions with axis=1 (I've seen it argued that >> these are not common use cases). It also wouldn't change the fact that we >> need BlockManager or something like it to do alignment. Getting array-like >> operations out of BlockManager/Block and into PandasArray could work with >> the all-1D idea. >> >> > 1. The current structure of [...] They're different enough that EA >> isn't a drop-in replacement for ndarray. >> > 2. Arrays being either 1D or 2D causes many issues. >> > Q1: Do those two issues accurately capture your concerns as well? >> >> Pretty much, yes. >> >> > Q2: Can you clarify: with 2D EAs would *all* EAs stored within pandas >> be 2D internally (and Series / Index would squeeze before data gets back to >> the user)? Otherwise, I don't see how we get the internal simplification. >> >> My thought was that only EAs backing Blocks inside DataFrames would get >> reshaped. Everything else would retain their existing dimensions. It >> isn't clear to me why you'd want 2D backing Index/Series, though I'm open >> to being convinced. >> > > I think I was missing a subtle point; We'll still have a mix of 1-D and > 2-D blocks under your proposal. What we *won't* have is cases where the > Block.shape doesn't match the Block.values.shape? > > ``` > In [9]: df = pd.DataFrame({"A": [1, 2], 'B': pd.array([1, 2], > dtype='Int64')}) > > In [10]: df._data.blocks > Out[10]: > (IntBlock: slice(0, 1, 1), 1 x 2, dtype: int64, > ExtensionBlock: slice(1, 2, 1), 1 x 2, dtype: Int64) > > In [11]: df._data.blocks[0].shape == df._data.blocks[0].values.shape > Out[11]: True > > In [12]: df._data.blocks[1].shape == df._data.blocks[1].values.shape > Out[12]: False > > ``` > > So assuming we want the laudable goal of Block.{ndim,shape} == > Block.values.{ndim,shape}, we have two options > > 1. Allow BlockManager to store a mix of 1-D and 2-D blocks. > 2. Allow EAs to be (at least internally) reshaped to 2D (possibly just (N, > 1) or (1, N)). > > Is that fair? Doing number 1 sounds really bad / difficult, right? > > Demonstrating the internal simplification may require a Proof of Concept. >> >> > Our contract on EA is 1D and I agree that changing this is not a good >> idea (at least publicly). >> >> Another slightly hacky option would be to secretly allow EAs to be >> temporarily 2D while backing DataFrame Blocks, but restrict that to just >> (N, 1) or (1, N). That wouldn't do anything to address the arithmetic >> performance, but might let us de-kludge the 1D/2D code in core.internals. >> > > I would be curious to see this. And I think it would help with the > *regression* in arithmetic performance, right? Since ndarray-backed blocks > would be allowed to be (N, P)? > > >> On Wed, Jun 12, 2019 at 1:56 PM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> Op wo 12 jun. 2019 om 18:18 schreef Jeff Reback : >>> >>>> ... >>>> >>> So here's another proposal (a bit half-baked but....): >>>> >>>> You *could* build a single dtyped container that actually holds the 1D >>>> arrays themselves). Then you could put EA arrays and numpy arrays on the >>>> same footing. Meaning each >>>> 'Block' would be exactly the same. >>>> >>>> - This would make operations the *same* across all 'Blocks', reducing >>>> complexity >>>> - We could simply take views on 2D numpy arrays to actually avoid a >>>> performance penaltly of copying (as we can construct from a 2D numpy array >>>> a lot); this causes some aggregation ops to be much slower that if we >>>> actually copy, but that has a cost too >>>> - Ops could be defined on EA & Pandas Arrays; these can then operate >>>> array-by-array (within a Block), or using numba we could implement the ops >>>> in a way that we can get a pretty big speedup for a particular kernel >>>> >>> >>> How would this proposal avoid the above-mentioned performance >>> implication of doing ops column-by-column? >>> >>> In general, I think we should try to do a few basic benchmarks on what >>> the performance impact would be for some typical use cases when all ops are >>> done column-by-column / all columns are stored as separate blocks (Jeff had >>> a branch at some point that made this optional). To have a better idea of >>> the (dis)advantages for the different proposals. >>> >>> Brock, can you given your thoughts about the idea of having _only_ 1D >>> Blocks? That could also solve a lot of the complexity of the internals (no >>> 1D vs 2D) and have many of the advantages you mentioned in the first email, >>> but I don't think you really answered that aspect. >>> >>> Joris >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Thu Jun 13 15:48:31 2019 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Thu, 13 Jun 2019 14:48:31 -0500 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: OK thanks, I think I understand things better now. IMO, the most promising line of development is internally reshaping EAs to be (N, 1). No concrete thoughts on how to do this yet though. And there's another reason I'd prefer not to have public 2-D EAs yet. Aside from the potential future block manager simplification, 2+-dimensional arrays open up a bunch of complexity for EA authors, especially around indexing, take, and concat. I'd prefer to delay that while we still have other options on the table that look promising. On Thu, Jun 13, 2019 at 2:28 PM Brock Mendel wrote: > > I think I was missing a subtle point; We'll still have a mix of 1-D and > 2-D blocks under your proposal. What we *won't* have is cases where the > Block.shape doesn't match the Block.values.shape? > > Correct. For curious readers, "mix" here only means that both 1D Block > and 2D blocks will exist. Within a DataFrame you will only ever see 2D > blocks. Within a Series you will only ever find a single 1D Block. (at > least under the current proposal; Tom's option 1 above would change this) > > > So assuming we want the laudable goal of Block.{ndim,shape} == > Block.values.{ndim,shape}, > > Following that train of thought (hopefully this helps explain the promised > simplifications): > - now blocks don't need an ndim attribute or kwarg for the constructor > - so their only attributes are mgr_locs and values > - and hey, when do they ever use self.mgr_locs? Only when calling their > own constructors! These make more sense as a BlockManager attribute anyway. > - so... if Block's only attribute is `values`, maybe we can get rid of > Block altogether? > > > 1. Allow BlockManager to store a mix of 1-D and 2-D blocks. > > 2. Allow EAs to be (at least internally) reshaped to 2D (possibly just > (N, 1) or (1, N)). > > > > Is that fair? Doing number 1 sounds really bad / difficult, right? > > Option 1: I haven't really thought about, but my intuition is that it > would cause new headaches. I expect this would show up in the > reshape/concat code, which I'm not as familiar with. > > Option 2: allowing (N,1) and (1,N) would give us most of the discussed > simplifications (and bugfixes) in Block/BlockManager. Even without the > performance considerations, I would consider this a massive win. > > > And I think it would help with the *regression* in arithmetic > performance, right? Since ndarray-backed blocks would be allowed to be (N, > P)? > > Not directly, but I think it would make one of the > previously-tried-and-failed approaches less problematic. > > > On Thu, Jun 13, 2019 at 11:30 AM Tom Augspurger < > tom.augspurger88 at gmail.com> wrote: > >> >> >> On Thu, Jun 13, 2019 at 12:20 PM Brock Mendel >> wrote: >> >>> > can you given your thoughts about the idea of having _only_ 1D Blocks? >>> That could also solve a lot of the complexity of the internals (no 1D vs >>> 2D) and have many of the advantages you mentioned in the first email, but I >>> don't think you really answered that aspect. >>> >>> My read from the earlier email was that you were going to present a more >>> detailed proposal for what this would look like. Going all-1D would solve >>> the sometimes-1D-sometimes-2D problem, but I think it would cause real >>> problems for transpose and reductions with axis=1 (I've seen it argued that >>> these are not common use cases). It also wouldn't change the fact that we >>> need BlockManager or something like it to do alignment. Getting array-like >>> operations out of BlockManager/Block and into PandasArray could work with >>> the all-1D idea. >>> >>> > 1. The current structure of [...] They're different enough that EA >>> isn't a drop-in replacement for ndarray. >>> > 2. Arrays being either 1D or 2D causes many issues. >>> > Q1: Do those two issues accurately capture your concerns as well? >>> >>> Pretty much, yes. >>> >>> > Q2: Can you clarify: with 2D EAs would *all* EAs stored within pandas >>> be 2D internally (and Series / Index would squeeze before data gets back to >>> the user)? Otherwise, I don't see how we get the internal simplification. >>> >>> My thought was that only EAs backing Blocks inside DataFrames would get >>> reshaped. Everything else would retain their existing dimensions. It >>> isn't clear to me why you'd want 2D backing Index/Series, though I'm open >>> to being convinced. >>> >> >> I think I was missing a subtle point; We'll still have a mix of 1-D and >> 2-D blocks under your proposal. What we *won't* have is cases where the >> Block.shape doesn't match the Block.values.shape? >> >> ``` >> In [9]: df = pd.DataFrame({"A": [1, 2], 'B': pd.array([1, 2], >> dtype='Int64')}) >> >> In [10]: df._data.blocks >> Out[10]: >> (IntBlock: slice(0, 1, 1), 1 x 2, dtype: int64, >> ExtensionBlock: slice(1, 2, 1), 1 x 2, dtype: Int64) >> >> In [11]: df._data.blocks[0].shape == df._data.blocks[0].values.shape >> Out[11]: True >> >> In [12]: df._data.blocks[1].shape == df._data.blocks[1].values.shape >> Out[12]: False >> >> ``` >> >> So assuming we want the laudable goal of Block.{ndim,shape} == >> Block.values.{ndim,shape}, we have two options >> >> 1. Allow BlockManager to store a mix of 1-D and 2-D blocks. >> 2. Allow EAs to be (at least internally) reshaped to 2D (possibly just >> (N, 1) or (1, N)). >> >> Is that fair? Doing number 1 sounds really bad / difficult, right? >> >> Demonstrating the internal simplification may require a Proof of Concept. >>> >>> > Our contract on EA is 1D and I agree that changing this is not a good >>> idea (at least publicly). >>> >>> Another slightly hacky option would be to secretly allow EAs to be >>> temporarily 2D while backing DataFrame Blocks, but restrict that to just >>> (N, 1) or (1, N). That wouldn't do anything to address the arithmetic >>> performance, but might let us de-kludge the 1D/2D code in core.internals. >>> >> >> I would be curious to see this. And I think it would help with the >> *regression* in arithmetic performance, right? Since ndarray-backed blocks >> would be allowed to be (N, P)? >> >> >>> On Wed, Jun 12, 2019 at 1:56 PM Joris Van den Bossche < >>> jorisvandenbossche at gmail.com> wrote: >>> >>>> Op wo 12 jun. 2019 om 18:18 schreef Jeff Reback : >>>> >>>>> ... >>>>> >>>> So here's another proposal (a bit half-baked but....): >>>>> >>>>> You *could* build a single dtyped container that actually holds the 1D >>>>> arrays themselves). Then you could put EA arrays and numpy arrays on the >>>>> same footing. Meaning each >>>>> 'Block' would be exactly the same. >>>>> >>>>> - This would make operations the *same* across all 'Blocks', reducing >>>>> complexity >>>>> - We could simply take views on 2D numpy arrays to actually avoid a >>>>> performance penaltly of copying (as we can construct from a 2D numpy array >>>>> a lot); this causes some aggregation ops to be much slower that if we >>>>> actually copy, but that has a cost too >>>>> - Ops could be defined on EA & Pandas Arrays; these can then operate >>>>> array-by-array (within a Block), or using numba we could implement the ops >>>>> in a way that we can get a pretty big speedup for a particular kernel >>>>> >>>> >>>> How would this proposal avoid the above-mentioned performance >>>> implication of doing ops column-by-column? >>>> >>>> In general, I think we should try to do a few basic benchmarks on what >>>> the performance impact would be for some typical use cases when all ops are >>>> done column-by-column / all columns are stored as separate blocks (Jeff had >>>> a branch at some point that made this optional). To have a better idea of >>>> the (dis)advantages for the different proposals. >>>> >>>> Brock, can you given your thoughts about the idea of having _only_ 1D >>>> Blocks? That could also solve a lot of the complexity of the internals (no >>>> 1D vs 2D) and have many of the advantages you mentioned in the first email, >>>> but I don't think you really answered that aspect. >>>> >>>> Joris >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Wed Jun 19 15:26:09 2019 From: jbrockmendel at gmail.com (Brock Mendel) Date: Wed, 19 Jun 2019 12:26:09 -0700 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: A Proof of Concept is up at https://github.com/pandas-dev/pandas/pull/26914. A brief overview: - implement ReshapeMixin for EAs that wrap an ndarray (e.g. DatetimeArray, PeriodArray, Categorical, CyberPandas, ...). For this type of EAs, the implementation is pretty trivial. - Patch DatetimeArray.__getitem__ and _box_values - See how much core.internals simplification becomes feasible - DatetimeTZBlock can now use the base class implementations for shape, _slice, copy, iget, and interpolate. - DatetimeTZBlock becomes a thin wrapper around Block.where for when Block.where incorrectly casts to object dtype. - With minor additional edits on DatetimeArray, we could also remove the need for DatetimeTZBlock to override diff, shift, take_nd - If we allow DatetimeTZBlock to hold multiple columns, _unstack could also use the base class implementation. This also turned up existing bugs ( https://github.com/pandas-dev/pandas/issues/26864) that I speculate would be easier to address with less Block/BlockManager complexity. Bottom Line: The relevant array operations are going to be defined for EAs regardless. The question is whether they are going to be defined directly on the EAs (and tested in isolation), or defined by the Blocks (and tested indirectly). I advocate the former. On Thu, Jun 13, 2019 at 12:48 PM Tom Augspurger wrote: > OK thanks, I think I understand things better now. IMO, the most promising > line of development is internally reshaping EAs to be (N, 1). No concrete > thoughts on how to do this yet though. > > And there's another reason I'd prefer not to have public 2-D EAs yet. > Aside from the potential future block manager simplification, 2+-dimensional > arrays open up a bunch of complexity for EA authors, especially around > indexing, take, and concat. I'd prefer to delay that while we still have > other > options on the table that look promising. > > > On Thu, Jun 13, 2019 at 2:28 PM Brock Mendel > wrote: > >> > I think I was missing a subtle point; We'll still have a mix of 1-D and >> 2-D blocks under your proposal. What we *won't* have is cases where the >> Block.shape doesn't match the Block.values.shape? >> >> Correct. For curious readers, "mix" here only means that both 1D Block >> and 2D blocks will exist. Within a DataFrame you will only ever see 2D >> blocks. Within a Series you will only ever find a single 1D Block. (at >> least under the current proposal; Tom's option 1 above would change this) >> >> > So assuming we want the laudable goal of Block.{ndim,shape} == >> Block.values.{ndim,shape}, >> >> Following that train of thought (hopefully this helps explain the >> promised simplifications): >> - now blocks don't need an ndim attribute or kwarg for the constructor >> - so their only attributes are mgr_locs and values >> - and hey, when do they ever use self.mgr_locs? Only when calling their >> own constructors! These make more sense as a BlockManager attribute anyway. >> - so... if Block's only attribute is `values`, maybe we can get rid of >> Block altogether? >> >> > 1. Allow BlockManager to store a mix of 1-D and 2-D blocks. >> > 2. Allow EAs to be (at least internally) reshaped to 2D (possibly just >> (N, 1) or (1, N)). >> > >> > Is that fair? Doing number 1 sounds really bad / difficult, right? >> >> Option 1: I haven't really thought about, but my intuition is that it >> would cause new headaches. I expect this would show up in the >> reshape/concat code, which I'm not as familiar with. >> >> Option 2: allowing (N,1) and (1,N) would give us most of the discussed >> simplifications (and bugfixes) in Block/BlockManager. Even without the >> performance considerations, I would consider this a massive win. >> >> > And I think it would help with the *regression* in arithmetic >> performance, right? Since ndarray-backed blocks would be allowed to be (N, >> P)? >> >> Not directly, but I think it would make one of the >> previously-tried-and-failed approaches less problematic. >> >> >> On Thu, Jun 13, 2019 at 11:30 AM Tom Augspurger < >> tom.augspurger88 at gmail.com> wrote: >> >>> >>> >>> On Thu, Jun 13, 2019 at 12:20 PM Brock Mendel >>> wrote: >>> >>>> > can you given your thoughts about the idea of having _only_ 1D >>>> Blocks? That could also solve a lot of the complexity of the internals (no >>>> 1D vs 2D) and have many of the advantages you mentioned in the first email, >>>> but I don't think you really answered that aspect. >>>> >>>> My read from the earlier email was that you were going to present a >>>> more detailed proposal for what this would look like. Going all-1D would >>>> solve the sometimes-1D-sometimes-2D problem, but I think it would cause >>>> real problems for transpose and reductions with axis=1 (I've seen it argued >>>> that these are not common use cases). It also wouldn't change the fact >>>> that we need BlockManager or something like it to do alignment. Getting >>>> array-like operations out of BlockManager/Block and into PandasArray could >>>> work with the all-1D idea. >>>> >>>> > 1. The current structure of [...] They're different enough that EA >>>> isn't a drop-in replacement for ndarray. >>>> > 2. Arrays being either 1D or 2D causes many issues. >>>> > Q1: Do those two issues accurately capture your concerns as well? >>>> >>>> Pretty much, yes. >>>> >>>> > Q2: Can you clarify: with 2D EAs would *all* EAs stored within pandas >>>> be 2D internally (and Series / Index would squeeze before data gets back to >>>> the user)? Otherwise, I don't see how we get the internal simplification. >>>> >>>> My thought was that only EAs backing Blocks inside DataFrames would get >>>> reshaped. Everything else would retain their existing dimensions. It >>>> isn't clear to me why you'd want 2D backing Index/Series, though I'm open >>>> to being convinced. >>>> >>> >>> I think I was missing a subtle point; We'll still have a mix of 1-D and >>> 2-D blocks under your proposal. What we *won't* have is cases where the >>> Block.shape doesn't match the Block.values.shape? >>> >>> ``` >>> In [9]: df = pd.DataFrame({"A": [1, 2], 'B': pd.array([1, 2], >>> dtype='Int64')}) >>> >>> In [10]: df._data.blocks >>> Out[10]: >>> (IntBlock: slice(0, 1, 1), 1 x 2, dtype: int64, >>> ExtensionBlock: slice(1, 2, 1), 1 x 2, dtype: Int64) >>> >>> In [11]: df._data.blocks[0].shape == df._data.blocks[0].values.shape >>> Out[11]: True >>> >>> In [12]: df._data.blocks[1].shape == df._data.blocks[1].values.shape >>> Out[12]: False >>> >>> ``` >>> >>> So assuming we want the laudable goal of Block.{ndim,shape} == >>> Block.values.{ndim,shape}, we have two options >>> >>> 1. Allow BlockManager to store a mix of 1-D and 2-D blocks. >>> 2. Allow EAs to be (at least internally) reshaped to 2D (possibly just >>> (N, 1) or (1, N)). >>> >>> Is that fair? Doing number 1 sounds really bad / difficult, right? >>> >>> Demonstrating the internal simplification may require a Proof of Concept. >>>> >>>> > Our contract on EA is 1D and I agree that changing this is not a good >>>> idea (at least publicly). >>>> >>>> Another slightly hacky option would be to secretly allow EAs to be >>>> temporarily 2D while backing DataFrame Blocks, but restrict that to just >>>> (N, 1) or (1, N). That wouldn't do anything to address the arithmetic >>>> performance, but might let us de-kludge the 1D/2D code in core.internals. >>>> >>> >>> I would be curious to see this. And I think it would help with the >>> *regression* in arithmetic performance, right? Since ndarray-backed blocks >>> would be allowed to be (N, P)? >>> >>> >>>> On Wed, Jun 12, 2019 at 1:56 PM Joris Van den Bossche < >>>> jorisvandenbossche at gmail.com> wrote: >>>> >>>>> Op wo 12 jun. 2019 om 18:18 schreef Jeff Reback >>>> >: >>>>> >>>>>> ... >>>>>> >>>>> So here's another proposal (a bit half-baked but....): >>>>>> >>>>>> You *could* build a single dtyped container that actually holds the >>>>>> 1D arrays themselves). Then you could put EA arrays and numpy arrays on the >>>>>> same footing. Meaning each >>>>>> 'Block' would be exactly the same. >>>>>> >>>>>> - This would make operations the *same* across all 'Blocks', reducing >>>>>> complexity >>>>>> - We could simply take views on 2D numpy arrays to actually avoid a >>>>>> performance penaltly of copying (as we can construct from a 2D numpy array >>>>>> a lot); this causes some aggregation ops to be much slower that if we >>>>>> actually copy, but that has a cost too >>>>>> - Ops could be defined on EA & Pandas Arrays; these can then operate >>>>>> array-by-array (within a Block), or using numba we could implement the ops >>>>>> in a way that we can get a pretty big speedup for a particular kernel >>>>>> >>>>> >>>>> How would this proposal avoid the above-mentioned performance >>>>> implication of doing ops column-by-column? >>>>> >>>>> In general, I think we should try to do a few basic benchmarks on what >>>>> the performance impact would be for some typical use cases when all ops are >>>>> done column-by-column / all columns are stored as separate blocks (Jeff had >>>>> a branch at some point that made this optional). To have a better idea of >>>>> the (dis)advantages for the different proposals. >>>>> >>>>> Brock, can you given your thoughts about the idea of having _only_ 1D >>>>> Blocks? That could also solve a lot of the complexity of the internals (no >>>>> 1D vs 2D) and have many of the advantages you mentioned in the first email, >>>>> but I don't think you really answered that aspect. >>>>> >>>>> Joris >>>>> >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Wed Jun 19 21:41:40 2019 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Wed, 19 Jun 2019 20:41:40 -0500 Subject: [Pandas-dev] Pandas dev meeting June 20th Message-ID: Hi all, The next pandas dev meeting is June 20th (tomorrow) at 17:00 UTC / 12:00 central. All are welcome to join. Minutes: https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing Hangout: meet.google.com/xbh-thar-tzw Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Thu Jun 20 11:42:49 2019 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 20 Jun 2019 17:42:49 +0200 Subject: [Pandas-dev] Arithmetic Proposal In-Reply-To: References: Message-ID: Thanks Brock for the update and POC. I certainly don't fully understand yet all details and consequences for the EA of the proposal (given the discussion on github ..), but I have the feeling that this is moving complexity (to deal with both 1D/2D) from our internals / BlockManager to the ExtensionArray. Or, if we still want to allow EAs that are strictly 1D: moving the complexity to the ExtensionBlock (which is of course a more focused part of the internals (possibly a plus), but that also means that internal and external 1D EAs start to deviate more). I am not sure that I find that a net improvement. I rather keep some complexity concentrated in our internals, than exposing that complexity on the ExtensionArrays. But maybe I don't work enough in the internals code to really understand the problem this is trying to solve. Repeating myself from earlier in this thread: if we want to put considerable effort in refactoring the internals, I think we should seriously consider other options. Joris Op wo 19 jun. 2019 om 21:26 schreef Brock Mendel : > A Proof of Concept is up at > https://github.com/panhttps://issues.apache.org/jira/browse/ARROW-5665das-dev/pandas/pull/26914 > . A brief overview: > > - implement ReshapeMixin for EAs that wrap an ndarray (e.g. DatetimeArray, > PeriodArray, Categorical, CyberPandas, ...). For this type of EAs, the > implementation is pretty trivial. > - Patch DatetimeArray.__getitem__ and _box_values > - See how much core.internals simplification becomes feasible > - DatetimeTZBlock can now use the base class implementations for shape, > _slice, copy, iget, and interpolate. > - DatetimeTZBlock becomes a thin wrapper around Block.where for when > Block.where incorrectly casts to object dtype. > - With minor additional edits on DatetimeArray, we could also remove > the need for DatetimeTZBlock to override diff, shift, take_nd > - If we allow DatetimeTZBlock to hold multiple columns, _unstack could > also use the base class implementation. > > This also turned up existing bugs ( > https://github.com/pandas-dev/pandas/issues/26864) that I speculate would > be easier to address with less Block/BlockManager complexity. > > Bottom Line: The relevant array operations are going to be defined for EAs > regardless. The question is whether they are going to be defined directly > on the EAs (and tested in isolation), or defined by the Blocks (and tested > indirectly). I advocate the former. > > On Thu, Jun 13, 2019 at 12:48 PM Tom Augspurger < > tom.augspurger88 at gmail.com> wrote: > >> OK thanks, I think I understand things better now. IMO, the most >> promising line of development is internally reshaping EAs to be (N, 1). No >> concrete >> thoughts on how to do this yet though. >> >> And there's another reason I'd prefer not to have public 2-D EAs yet. >> Aside from the potential future block manager simplification, 2+-dimensional >> arrays open up a bunch of complexity for EA authors, especially around >> indexing, take, and concat. I'd prefer to delay that while we still have >> other >> options on the table that look promising. >> >> >> On Thu, Jun 13, 2019 at 2:28 PM Brock Mendel >> wrote: >> >>> > I think I was missing a subtle point; We'll still have a mix of 1-D >>> and 2-D blocks under your proposal. What we *won't* have is cases where the >>> Block.shape doesn't match the Block.values.shape? >>> >>> Correct. For curious readers, "mix" here only means that both 1D Block >>> and 2D blocks will exist. Within a DataFrame you will only ever see 2D >>> blocks. Within a Series you will only ever find a single 1D Block. (at >>> least under the current proposal; Tom's option 1 above would change this) >>> >>> > So assuming we want the laudable goal of Block.{ndim,shape} == >>> Block.values.{ndim,shape}, >>> >>> Following that train of thought (hopefully this helps explain the >>> promised simplifications): >>> - now blocks don't need an ndim attribute or kwarg for the constructor >>> - so their only attributes are mgr_locs and values >>> - and hey, when do they ever use self.mgr_locs? Only when calling >>> their own constructors! These make more sense as a BlockManager attribute >>> anyway. >>> - so... if Block's only attribute is `values`, maybe we can get rid of >>> Block altogether? >>> >>> > 1. Allow BlockManager to store a mix of 1-D and 2-D blocks. >>> > 2. Allow EAs to be (at least internally) reshaped to 2D (possibly just >>> (N, 1) or (1, N)). >>> > >>> > Is that fair? Doing number 1 sounds really bad / difficult, right? >>> >>> Option 1: I haven't really thought about, but my intuition is that it >>> would cause new headaches. I expect this would show up in the >>> reshape/concat code, which I'm not as familiar with. >>> >>> Option 2: allowing (N,1) and (1,N) would give us most of the discussed >>> simplifications (and bugfixes) in Block/BlockManager. Even without the >>> performance considerations, I would consider this a massive win. >>> >>> > And I think it would help with the *regression* in arithmetic >>> performance, right? Since ndarray-backed blocks would be allowed to be (N, >>> P)? >>> >>> Not directly, but I think it would make one of the >>> previously-tried-and-failed approaches less problematic. >>> >>> >>> On Thu, Jun 13, 2019 at 11:30 AM Tom Augspurger < >>> tom.augspurger88 at gmail.com> wrote: >>> >>>> >>>> >>>> On Thu, Jun 13, 2019 at 12:20 PM Brock Mendel >>>> wrote: >>>> >>>>> > can you given your thoughts about the idea of having _only_ 1D >>>>> Blocks? That could also solve a lot of the complexity of the internals (no >>>>> 1D vs 2D) and have many of the advantages you mentioned in the first email, >>>>> but I don't think you really answered that aspect. >>>>> >>>>> My read from the earlier email was that you were going to present a >>>>> more detailed proposal for what this would look like. Going all-1D would >>>>> solve the sometimes-1D-sometimes-2D problem, but I think it would cause >>>>> real problems for transpose and reductions with axis=1 (I've seen it argued >>>>> that these are not common use cases). It also wouldn't change the fact >>>>> that we need BlockManager or something like it to do alignment. Getting >>>>> array-like operations out of BlockManager/Block and into PandasArray could >>>>> work with the all-1D idea. >>>>> >>>>> > 1. The current structure of [...] They're different enough that EA >>>>> isn't a drop-in replacement for ndarray. >>>>> > 2. Arrays being either 1D or 2D causes many issues. >>>>> > Q1: Do those two issues accurately capture your concerns as well? >>>>> >>>>> Pretty much, yes. >>>>> >>>>> > Q2: Can you clarify: with 2D EAs would *all* EAs stored within >>>>> pandas be 2D internally (and Series / Index would squeeze before data gets >>>>> back to the user)? Otherwise, I don't see how we get the internal >>>>> simplification. >>>>> >>>>> My thought was that only EAs backing Blocks inside DataFrames would >>>>> get reshaped. Everything else would retain their existing dimensions. It >>>>> isn't clear to me why you'd want 2D backing Index/Series, though I'm open >>>>> to being convinced. >>>>> >>>> >>>> I think I was missing a subtle point; We'll still have a mix of 1-D and >>>> 2-D blocks under your proposal. What we *won't* have is cases where the >>>> Block.shape doesn't match the Block.values.shape? >>>> >>>> ``` >>>> In [9]: df = pd.DataFrame({"A": [1, 2], 'B': pd.array([1, 2], >>>> dtype='Int64')}) >>>> >>>> In [10]: df._data.blocks >>>> Out[10]: >>>> (IntBlock: slice(0, 1, 1), 1 x 2, dtype: int64, >>>> ExtensionBlock: slice(1, 2, 1), 1 x 2, dtype: Int64) >>>> >>>> In [11]: df._data.blocks[0].shape == df._data.blocks[0].values.shape >>>> Out[11]: True >>>> >>>> In [12]: df._data.blocks[1].shape == df._data.blocks[1].values.shape >>>> Out[12]: False >>>> >>>> ``` >>>> >>>> So assuming we want the laudable goal of Block.{ndim,shape} == >>>> Block.values.{ndim,shape}, we have two options >>>> >>>> 1. Allow BlockManager to store a mix of 1-D and 2-D blocks. >>>> 2. Allow EAs to be (at least internally) reshaped to 2D (possibly just >>>> (N, 1) or (1, N)). >>>> >>>> Is that fair? Doing number 1 sounds really bad / difficult, right? >>>> >>>> Demonstrating the internal simplification may require a Proof of >>>>> Concept. >>>>> >>>>> > Our contract on EA is 1D and I agree that changing this is not a >>>>> good idea (at least publicly). >>>>> >>>>> Another slightly hacky option would be to secretly allow EAs to be >>>>> temporarily 2D while backing DataFrame Blocks, but restrict that to just >>>>> (N, 1) or (1, N). That wouldn't do anything to address the arithmetic >>>>> performance, but might let us de-kludge the 1D/2D code in core.internals. >>>>> >>>> >>>> I would be curious to see this. And I think it would help with the >>>> *regression* in arithmetic performance, right? Since ndarray-backed blocks >>>> would be allowed to be (N, P)? >>>> >>>> >>>>> On Wed, Jun 12, 2019 at 1:56 PM Joris Van den Bossche < >>>>> jorisvandenbossche at gmail.com> wrote: >>>>> >>>>>> Op wo 12 jun. 2019 om 18:18 schreef Jeff Reback >>>>> >: >>>>>> >>>>>>> ... >>>>>>> >>>>>> So here's another proposal (a bit half-baked but....): >>>>>>> >>>>>>> You *could* build a single dtyped container that actually holds the >>>>>>> 1D arrays themselves). Then you could put EA arrays and numpy arrays on the >>>>>>> same footing. Meaning each >>>>>>> 'Block' would be exactly the same. >>>>>>> >>>>>>> - This would make operations the *same* across all 'Blocks', >>>>>>> reducing complexity >>>>>>> - We could simply take views on 2D numpy arrays to actually avoid a >>>>>>> performance penaltly of copying (as we can construct from a 2D numpy array >>>>>>> a lot); this causes some aggregation ops to be much slower that if we >>>>>>> actually copy, but that has a cost too >>>>>>> - Ops could be defined on EA & Pandas Arrays; these can then operate >>>>>>> array-by-array (within a Block), or using numba we could implement the ops >>>>>>> in a way that we can get a pretty big speedup for a particular kernel >>>>>>> >>>>>> >>>>>> How would this proposal avoid the above-mentioned performance >>>>>> implication of doing ops column-by-column? >>>>>> >>>>>> In general, I think we should try to do a few basic benchmarks on >>>>>> what the performance impact would be for some typical use cases when all >>>>>> ops are done column-by-column / all columns are stored as separate blocks >>>>>> (Jeff had a branch at some point that made this optional). To have a better >>>>>> idea of the (dis)advantages for the different proposals. >>>>>> >>>>>> Brock, can you given your thoughts about the idea of having _only_ 1D >>>>>> Blocks? That could also solve a lot of the complexity of the internals (no >>>>>> 1D vs 2D) and have many of the advantages you mentioned in the first email, >>>>>> but I don't think you really answered that aspect. >>>>>> >>>>>> Joris >>>>>> >>>>>> _______________________________________________ >>>>>> Pandas-dev mailing list >>>>>> Pandas-dev at python.org >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>> >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlb at cfcl.com Sun Jun 23 22:54:23 2019 From: vlb at cfcl.com (Vicki Brown) Date: Sun, 23 Jun 2019 19:54:23 -0700 Subject: [Pandas-dev] Unexpected error using pd.read_json with chunksize Message-ID: <804E8B36-7808-4F2E-BE66-F8CE7F922C00@cfcl.com> TL;DR I am getting an error I don't understand from my pd.read_json reader: ValueError: Unexpected character found when decoding 'false' Details I am working a set of code exercises that use a Yelp reviews dataset. At this point in the exercises I am supposed to read in review.json which has one JSON record per line (and 6 million lines! ;-). Because the complete dataset file is enormous, the tutorial recommends using chunksize and building a json reader. When I try this, I get an error, even with my test input. I am befuddled. TESTS: I have smaller versions of the JSON file, with 100, 3, and 1 records, for testing. I can read all of the test files into a pandas dataframe and examine them, e.g.. path = 'file://localhost/Users/vlb/Learn/DSC_Intro/' filename = path + 'yelp_dataset/review_test.json' # read the entire file reviews = pd.read_json(filename, lines=True) I cannot read any of the test files in chunks. I get the same error. CODE: My code currently looks like this path 'file://localhost/Users/.../DSC_Intro/' filename = path + 'yelp_dataset/review_test.json' # create a reader to read in chunks review_reader = pd.read_json(StringIO(filename), lines=True, chunksize=1) `type(review_reader)` returns pandas.io.json.json.JsonReader as expected. However for chunk in review_reader: print(chunk) throws an error on all test files. /Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pandas/io/json/json.py in _parse_no_numpy(self) 869 if orient == "columns": 870 self.obj = DataFrame( --> 871 loads(json, precise_float=self.precise_float), dtype=None) 872 elif orient == "split": 873 decoded = {str(k): v for k, v in compat.iteritems( ValueError: Unexpected character found when decoding 'false' INPUT: Sample JSON {"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our family LOVES the food here. Quick, friendly, delicious, and a great restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"} references: https://www.yelp.com/dataset/documentation/main https://courses.springboard.com/courses/448797 From william.ayd at icloud.com Mon Jun 24 09:11:01 2019 From: william.ayd at icloud.com (William Ayd) Date: Mon, 24 Jun 2019 09:11:01 -0400 Subject: [Pandas-dev] Unexpected error using pd.read_json with chunksize In-Reply-To: <804E8B36-7808-4F2E-BE66-F8CE7F922C00@cfcl.com> References: <804E8B36-7808-4F2E-BE66-F8CE7F922C00@cfcl.com> Message-ID: Can you determine which line in the file is throwing the error? > On Jun 23, 2019, at 10:54 PM, Vicki Brown wrote: > > TL;DR > > I am getting an error I don't understand from my pd.read_json reader: > > ValueError: Unexpected character found when decoding 'false' > > > Details > > I am working a set of code exercises that use a Yelp reviews dataset. At this point in the exercises I am supposed to read in review.json which has one JSON record per line (and 6 million lines! ;-). > > Because the complete dataset file is enormous, the tutorial recommends using chunksize and building a json reader. When I try this, I get an error, even with my test input. > > I am befuddled. > > TESTS: > > I have smaller versions of the JSON file, with 100, 3, and 1 records, for testing. I can read all of the test files into a pandas dataframe and examine them, e.g.. > > path = 'file://localhost/Users/vlb/Learn/DSC_Intro/' > filename = path + 'yelp_dataset/review_test.json' > > # read the entire file > reviews = pd.read_json(filename, lines=True) > > I cannot read any of the test files in chunks. I get the same error. > > CODE: > > My code currently looks like this > > path 'file://localhost/Users/.../DSC_Intro/' > filename = path + 'yelp_dataset/review_test.json' > > # create a reader to read in chunks > > review_reader = pd.read_json(StringIO(filename), lines=True, chunksize=1) > > > `type(review_reader)` returns > > pandas.io.json.json.JsonReader > > as expected. However > > for chunk in review_reader: > print(chunk) > > throws an error on all test files. > > /Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pandas/io/json/json.py in _parse_no_numpy(self) > 869 if orient == "columns": > 870 self.obj = DataFrame( > --> 871 loads(json, precise_float=self.precise_float), dtype=None) > 872 elif orient == "split": > 873 decoded = {str(k): v for k, v in compat.iteritems( > > ValueError: Unexpected character found when decoding 'false' > > INPUT: Sample JSON > > {"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our family LOVES the food here. Quick, friendly, delicious, and a great restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"} > > references: > https://www.yelp.com/dataset/documentation/main > https://courses.springboard.com/courses/448797 > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From vlb at cfcl.com Mon Jun 24 12:35:43 2019 From: vlb at cfcl.com (Vicki Brown) Date: Mon, 24 Jun 2019 09:35:43 -0700 Subject: [Pandas-dev] Unexpected error using pd.read_json with chunksize In-Reply-To: References: Message-ID: <35E1F112-393B-4642-9F53-607AD87CEE72@cfcl.com> On Mon, 24 Jun 2019 09:11:01 -0400 William Ayd wrote >> Can you determine which line in the file is throwing the error? It doesn't seem to matter. I've pulled all three of the records in my 3-record test file out into individual files. They each fail the same way. I also tried sending one of those records directly through json.loads import json json_string = """ {"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our family LOVES the food here. Quick, friendly, delicious, and a great restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"} """ data = json.loads(json_string) type(data) where it works as expected. (I don't have the precise_float argument set. Perhaps that's failing?) - Vicki > On Jun 23, 2019, at 10:54 PM, Vicki Brown wrote: > > TL;DR > > I am getting an error I don't understand from my pd.read_json reader: > > ValueError: Unexpected character found when decoding 'false' > > > Details > > I am working a set of code exercises that use a Yelp reviews dataset. At this point in the exercises I am supposed to read in review.json which has one JSON record per line (and 6 million lines! ;-). > > Because the complete dataset file is enormous, the tutorial recommends using chunksize and building a json reader. When I try this, I get an error, even with my test input. > > I am befuddled. > > TESTS: > > I have smaller versions of the JSON file, with 100, 3, and 1 records, for testing. I can read all of the test files into a pandas dataframe and examine them, e.g.. > > path = 'file://localhost/Users/vlb/Learn/DSC_Intro/' > filename = path + 'yelp_dataset/review_test.json' > > # read the entire file > reviews = pd.read_json(filename, lines=True) > > I cannot read any of the test files in chunks. I get the same error. > > CODE: > > My code currently looks like this > > path 'file://localhost/Users/.../DSC_Intro/' > filename = path + 'yelp_dataset/review_test.json' > > # create a reader to read in chunks > > review_reader = pd.read_json(StringIO(filename), lines=True, chunksize=1) > > > `type(review_reader)` returns > > pandas.io.json.json.JsonReader > > as expected. However > > for chunk in review_reader: > print(chunk) > > throws an error on all test files. > > /Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pandas/io/json/json.py in _parse_no_numpy(self) > 869 if orient == "columns": > 870 self.obj = DataFrame( > --> 871 loads(json, precise_float=self.precise_float), dtype=None) > 872 elif orient == "split": > 873 decoded = {str(k): v for k, v in compat.iteritems( > > ValueError: Unexpected character found when decoding 'false' > > INPUT: Sample JSON > > {"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our family LOVES the food here. Quick, friendly, delicious, and a great restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"} > > references: > https://www.yelp.com/dataset/documentation/main > https://courses.springboard.com/courses/448797 > -- Vicki Vicki Brown cfcl.com/vlb From garcia.marc at gmail.com Mon Jun 24 13:30:58 2019 From: garcia.marc at gmail.com (Marc Garcia) Date: Mon, 24 Jun 2019 18:30:58 +0100 Subject: [Pandas-dev] Unexpected error using pd.read_json with chunksize In-Reply-To: <35E1F112-393B-4642-9F53-607AD87CEE72@cfcl.com> References: <35E1F112-393B-4642-9F53-607AD87CEE72@cfcl.com> Message-ID: Probably better if you open an issue in GitHub with an exact example that fails, and we'll investigate. On Mon, 24 Jun 2019, 17:36 Vicki Brown, wrote: > On Mon, 24 Jun 2019 09:11:01 -0400 William Ayd > wrote > > >> Can you determine which line in the file is throwing the error? > > It doesn't seem to matter. I've pulled all three of the records in my > 3-record test file out into individual files. They each fail the same way. > > I also tried sending one of those records directly through json.loads > > import json > json_string = """ > {"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our > family LOVES the food here. Quick, friendly, delicious, and a great > restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"} > """ > data = json.loads(json_string) > type(data) > > where it works as expected. (I don't have the precise_float argument set. > Perhaps that's failing?) > > - Vicki > > > > On Jun 23, 2019, at 10:54 PM, Vicki Brown wrote: > > > > TL;DR > > > > I am getting an error I don't understand from my pd.read_json reader: > > > > ValueError: Unexpected character found when decoding 'false' > > > > > > Details > > > > I am working a set of code exercises that use a Yelp reviews dataset. At > this point in the exercises I am supposed to read in review.json which has > one JSON record per line (and 6 million lines! ;-). > > > > Because the complete dataset file is enormous, the tutorial recommends > using chunksize and building a json reader. When I try this, I get an > error, even with my test input. > > > > I am befuddled. > > > > TESTS: > > > > I have smaller versions of the JSON file, with 100, 3, and 1 records, > for testing. I can read all of the test files into a pandas dataframe and > examine them, e.g.. > > > > path = 'file://localhost/Users/vlb/Learn/DSC_Intro/' > > filename = path + 'yelp_dataset/review_test.json' > > > > # read the entire file > > reviews = pd.read_json(filename, lines=True) > > > > I cannot read any of the test files in chunks. I get the same error. > > > > CODE: > > > > My code currently looks like this > > > > path 'file://localhost/Users/.../DSC_Intro/' > > filename = path + 'yelp_dataset/review_test.json' > > > > # create a reader to read in chunks > > > > review_reader = pd.read_json(StringIO(filename), lines=True, > chunksize=1) > > > > > > `type(review_reader)` returns > > > > pandas.io.json.json.JsonReader > > > > as expected. However > > > > for chunk in review_reader: > > print(chunk) > > > > throws an error on all test files. > > > > > /Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pandas/io/json/json.py > in _parse_no_numpy(self) > > 869 if orient == "columns": > > 870 self.obj = DataFrame( > > --> 871 loads(json, precise_float=self.precise_float), > dtype=None) > > 872 elif orient == "split": > > 873 decoded = {str(k): v for k, v in compat.iteritems( > > > > ValueError: Unexpected character found when decoding 'false' > > > > INPUT: Sample JSON > > > > > {"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our > family LOVES the food here. Quick, friendly, delicious, and a great > restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"} > > > > references: > > https://www.yelp.com/dataset/documentation/main > > https://courses.springboard.com/courses/448797 > > > > > > > -- Vicki > > Vicki Brown > cfcl.com/vlb > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlb at cfcl.com Mon Jun 24 14:08:50 2019 From: vlb at cfcl.com (Vicki Brown) Date: Mon, 24 Jun 2019 11:08:50 -0700 Subject: [Pandas-dev] Unexpected error using pd.read_json with chunksize In-Reply-To: References: <35E1F112-393B-4642-9F53-607AD87CEE72@cfcl.com> Message-ID: https://github.com/pandas-dev/pandas/issues/27022 > On Jun 24, 2019, at 10:30 , Marc Garcia wrote: > > Probably better if you open an issue in GitHub with an exact example that fails, and we'll investigate. > > On Mon, 24 Jun 2019, 17:36 Vicki Brown, > wrote: > On Mon, 24 Jun 2019 09:11:01 -0400 William Ayd > wrote > > >> Can you determine which line in the file is throwing the error? > > It doesn't seem to matter. I've pulled all three of the records in my 3-record test file out into individual files. They each fail the same way. > > > > > On Jun 23, 2019, at 10:54 PM, Vicki Brown > wrote: > > > > TL;DR > > > > I am getting an error I don't understand from my pd.read_json reader: > > > > ValueError: Unexpected character found when decoding 'false' > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pmlandwehr at gmail.com Wed Jun 26 12:23:47 2019 From: pmlandwehr at gmail.com (Pete[r] Landwehr) Date: Wed, 26 Jun 2019 09:23:47 -0700 Subject: [Pandas-dev] Old Version Number On The Website? Message-ID: Hey all, My apologies if this was covered in the last dev meeting or is simply the way the website has always been & I've just failed to notice, but the central release notes column on pandas.pydata.org appears to be one major version behind the current release. The "Versions" column shows the current release as 0.24.2 and the development release as 0.25.0, but the central column with change log notes shows the latest release as v0.23.4 Final (August 3, 2018). Is there a way to get the website updated? It's a bit confusing. Best, pml From tom.augspurger88 at gmail.com Wed Jun 26 12:41:41 2019 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Wed, 26 Jun 2019 11:41:41 -0500 Subject: [Pandas-dev] Old Version Number On The Website? In-Reply-To: References: Message-ID: Fixed in https://github.com/pandas-dev/pandas-website/pull/73. Thanks. On Wed, Jun 26, 2019 at 11:24 AM Pete[r] Landwehr wrote: > Hey all, > > My apologies if this was covered in the last dev meeting or is simply > the way the website has always been & I've just failed to notice, but > the central release notes column on pandas.pydata.org appears to be > one major version behind the current release. > > The "Versions" column shows the current release as 0.24.2 and the > development release as 0.25.0, but the central column with change log > notes shows the latest release as v0.23.4 Final (August 3, 2018). > > Is there a way to get the website updated? It's a bit confusing. > > Best, > > pml > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mahak at fossa.com Thu Jun 27 17:39:45 2019 From: mahak at fossa.com (Mahak Bandi) Date: Thu, 27 Jun 2019 14:39:45 -0700 Subject: [Pandas-dev] Pandas - FOSSA Badge Message-ID: Hey Pandas Team, One of our engineers stumbled upon your project, Pandas, on Github and posted it to our Slack! I think it's an awesome project. As you finish up development, I thought it would be great to offer up FOSSA for free to your project. We ship a tool that tracks your dependencies at each commit and scans them for licenses and copyright violations. We can block PR's that bring in bad dependencies, auto-generate attribution/NOTICE files, and more. We work with a lot of open source projects that you may have heard of ( ESLint , Mocha , Webpack /Grunt , Moment , Prometheus , etc.). We also just released a new badge that a couple other popular projects like Kubernetes and Dockers are getting on (screenshots below): [image: FOSSA_badge_example.png] [image: FOSSA_badge_example_2.png] If you're interested, I'd love to help you get something in place. Feel free to import Pandas at http://fossa.com and I can upgrade your account for free and follow up with some PRs! Best, Mahak Bandi -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: FOSSA_badge_example.png Type: image/png Size: 56118 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: FOSSA_badge_example_2.png Type: image/png Size: 216802 bytes Desc: not available URL: