From jbrockmendel at gmail.com Wed Jul 8 15:15:55 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Wed, 8 Jul 2020 12:15:55 -0700 Subject: [Pandas-dev] July Call Follow-Up Message-ID: Would keeping pyarrow optional allow us to bump the minimum version more aggressively than if it becomes required? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Wed Jul 8 20:22:14 2020 From: jeffreback at gmail.com (Jeff Reback) Date: Wed, 8 Jul 2020 20:22:14 -0400 Subject: [Pandas-dev] July Call Follow-Up In-Reply-To: References: Message-ID: yep optional actually allows us to have different version mins as well eg parquet vs csv reader can string could all be different (though possibly confusing) > > On Jul 8, 2020, at 3:16 PM, Brock Mendel wrote: > > ? > Would keeping pyarrow optional allow us to bump the minimum version more aggressively than if it becomes required? > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From mail at uwekorn.com Thu Jul 9 03:09:21 2020 From: mail at uwekorn.com (Uwe L. Korn) Date: Thu, 09 Jul 2020 09:09:21 +0200 Subject: [Pandas-dev] July Call Follow-Up In-Reply-To: References: Message-ID: Definitely, we will require the latest Arrow release quite some time for strings. I guess we will keep bumping the min version for that for at least until the end of the year continuously. On Thu, Jul 9, 2020, at 2:22 AM, Jeff Reback wrote: > yep > > optional actually allows us to have different version mins as well > > eg parquet vs csv reader can string could all be different (though > possibly confusing) > > > > > > On Jul 8, 2020, at 3:16 PM, Brock Mendel wrote: > > > > ? > > Would keeping pyarrow optional allow us to bump the minimum version more aggressively than if it becomes required? > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > From jorisvandenbossche at gmail.com Thu Jul 9 07:29:24 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 9 Jul 2020 13:29:24 +0200 Subject: [Pandas-dev] Taking pyarrow as a dependency for pandas [Was: July Call Follow-Up] In-Reply-To: References: Message-ID: Hi all, To try to clarify my hesitation to decide now about requiring Arrow for the "string" dtype (instead of using it optional, with fallback on the current python object based implementation of StringDtype). One of the goals of the string dtype, apart from potential speed-ups using arrow in the future, is also a better usability: getting rid of the confusing "object" dtype for something simple as strings (a use case many newcomers will see in the first example). And I think that the idea is to make this new string dtype the new default in case you have only string values, somewhere in the relatively (but undecided) near future. For example, let's say we do a 2.0 release next year using string dtype as default for string columns. Requiring Arrow for the string dtype would thus basically mean requiring Arrow for pandas in general, *if* we keep the plan to make this the default. And starting to use Arrow much more in pandas and add it as a required dependency is certainly a discussion we should have, but it's also a much bigger discussion as just the string dtype (how easy is it nowadays to install (including source installations), platform support, install size increase, minimum required dependency and possible conflicts with other packages/systems, ...). At the same time: I think it is rather easy to, at least on the short term, start experimenting with an Arrow-backed string dtype without requiring Arrow for the "string" dtype in general (we already have the Python code for it, we can keep that side by side for now). But we can discuss the details about this on the issue (https://github.com/pandas-dev/pandas/issues/35169). So to summarize: we should discuss this, but I think we should frame the question not as "require Arrow for an experimental, opt-in dtype", but "Arrow as required dependency for pandas". And given that this is a larger discussion: let's see that for now as a separate discussion from advancing the arrow-backed string dtype. Joris On Thu, 9 Jul 2020 at 09:12, Uwe L. Korn wrote: > > Definitely, we will require the latest Arrow release quite some time for strings. I guess we will keep bumping the min version for that for at least until the end of the year continuously. > > On Thu, Jul 9, 2020, at 2:22 AM, Jeff Reback wrote: > > yep > > > > optional actually allows us to have different version mins as well > > > > eg parquet vs csv reader can string could all be different (though > > possibly confusing) > > > > > > > > > > On Jul 8, 2020, at 3:16 PM, Brock Mendel wrote: > > > > > > ? > > > Would keeping pyarrow optional allow us to bump the minimum version more aggressively than if it becomes required? > > > _______________________________________________ > > > Pandas-dev mailing list > > > Pandas-dev at python.org > > > https://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From jorisvandenbossche at gmail.com Fri Jul 10 05:49:13 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 10 Jul 2020 11:49:13 +0200 Subject: [Pandas-dev] ROADMAP: add consistent missing values for all dtypes to the roadmap Message-ID: Hi all, This is a heads up to the mailing list that I opened a PR with a roadmap addition: https://github.com/pandas-dev/pandas/pull/35208 The PR proposes to add the long term goal of consistent missing value handling for all dtypes (started with the discussion and implementations around pd.NA / nullable dtypes) to the roadmap. Feedback welcome! Best, Joris From tom.augspurger88 at gmail.com Sun Jul 19 14:50:17 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Sun, 19 Jul 2020 13:50:17 -0500 Subject: [Pandas-dev] ANN: Pandas 1.1.0rc0 released Message-ID: Hi all, I'm pleased to announce that pandas 1.1.0rc0 is now available for testing. This is the first release candidate for pandas 1.1.0. The release can be installed from conda-forge conda create -n pandas-1.1.0rc0 -c conda-forge/label/pandas_rc -c conda-forge pandas=1.1.0rc0 Or from PyPI python -m pip install --pre pandas==1.1.0rc0 The release notes are available at https://pandas.pydata.org/pandas-docs/version/1.1.0/whatsnew/v1.1.0.html. Please report any issues with the release on the pandas issue tracker https://github.com/pandas-dev/pandas/issues! We plan to release 1.1.0 in the next few weeks. Thanks, Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Sun Jul 19 14:53:34 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Sun, 19 Jul 2020 13:53:34 -0500 Subject: [Pandas-dev] ANN: Pandas 1.1.0rc0 released In-Reply-To: References: Message-ID: FYI, at least on the archive at https://mail.python.org/pipermail/pandas-dev/2020-July/001267.html, the link to the whatsnew looks broken since the period is included in the link target. The release notes are at https://pandas.pydata.org/pandas-docs/version/1.1.0/whatsnew/v1.1.0.html On Sun, Jul 19, 2020 at 1:50 PM Tom Augspurger wrote: > Hi all, > > I'm pleased to announce that pandas 1.1.0rc0 is now available for testing. > This is the first release candidate for pandas 1.1.0. > > The release can be installed from conda-forge > > conda create -n pandas-1.1.0rc0 -c conda-forge/label/pandas_rc -c > conda-forge pandas=1.1.0rc0 > > Or from PyPI > > python -m pip install --pre pandas==1.1.0rc0 > > The release notes are available at > https://pandas.pydata.org/pandas-docs/version/1.1.0/whatsnew/v1.1.0.html. > > Please report any issues with the release on the pandas issue tracker > https://github.com/pandas-dev/pandas/issues! We plan to release 1.1.0 in > the next few weeks. > > Thanks, > > Tom > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Tue Jul 28 14:09:51 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Tue, 28 Jul 2020 13:09:51 -0500 Subject: [Pandas-dev] ANN: Pandas 1.1.0 released Message-ID: Hi all, I'm pleased to announce the release of pandas 1.1.0. This is a minor release which includes some new features, bug fixes, and performance improvements. We recommend that all users upgrade to this version. See the whatsnew for a list of all the changes. The release can be installed from PyPI python -m pip install --upgrade pandas==1.1.0 Or from conda-forge conda install -c conda-forge pandas==1.1.0 Thanks to all of the 368 contributors who made this release possible. Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From fuller.evan at gmail.com Tue Jul 28 16:27:29 2020 From: fuller.evan at gmail.com (fuller.evan at gmail.com) Date: Tue, 28 Jul 2020 16:27:29 -0400 Subject: [Pandas-dev] Help speeding up altered groubpy.value_counts Message-ID: <00f101d6651d$84738af0$8d5aa0d0$@gmail.com> All, I'm a new contributor to pandas and have been working to fix a couple bugs with the value_counts methods (pull request https://github.com/pandas-dev/pandas/pull/33652). I'm looking for a bit of help in maintaining speedy performance for https://github.com/DataInformer/pandas-1/blob/value_counts_normalize/pandas/ core/groupby/generic.py The SeriesGroupBy.value_counts method required significant rewrite in order to achieve correct behavior with dropna and normalize. After fixing that, I was asked to run performance tests, which unfortunately do show a significant performance hit for that method. I have been looking at how to close that gap as much as possible, but I've found only a few minor tweaks. When I do cProfile, I don't notice any clear offenders: numpy array functions are taking a lot of time total (numpy.core._multiarray_umath.implement_array_function), but I don't see any particular functions that are slow. Similarly, timeit experiments suggest that array concatenation is relatively slow, but not much different than other options like appending in the next function (e.g. I can do something like np.diff(np.nonzero(np.r_[changes, True])) or np.diff(np.nonzero(changes), append=len(changes)) there's not much of a timing difference). I have tried to do as little as possible with the multiindex, rebuilding it at the end. I would welcome any help or suggestions for how to make SeriesGroupBy.value_counts faster. Thanks, Evan Fuller -------------- next part -------------- An HTML attachment was scrubbed... URL: