From th020394 at gmail.com Sat Oct 7 01:55:37 2017 From: th020394 at gmail.com (Tyler Hardin) Date: Sat, 7 Oct 2017 01:55:37 -0400 Subject: [Pandas-dev] Mean, stdev, and var for array elements Message-ID: Hi, I'd really like to be able to calculate the mean, stdev, and var across within cells of a dataframe. It already works as I expect it to with sum. Example: import pandas as pd a = pd.Series([1, 2, 3, 4]) * 1. b = pd.Series([1, 2, 3, 4]) * 2. c = pd.Series([1, 2, 3, 4]) * 3. d = pd.Series([1, 2, 3, 4]) * 4. df = pd.DataFrame({'a' : [a,b,c,d]}, index=[0, 1, 2, 3]) print(df.a.sum()) Output: 0 10.0 1 20.0 2 30.0 3 40.0 dtype: float64 This is very useful for embedding a third dimension within a single column (because it's only needed there) instead of going full multi-index. For example, say you have a dataframe indexed on (date, stock) and in the dataframe you have columns for close pnl, close gmv, etc. Further, say you have a pnl_curve column, a minute-indexed (intraday) timeseries (again, unique per date, stock). As in, each (date, stock) has an associated intraday pnl curve (pd.Series object) in the column. >From that setup, I want to reduce away the stock dimension. I might want to sum the pnl curves (to get overall intraday pnl curves for each date). This actually works already. (As simple as df.pnl_curve.sum()). But I'd also like to plot the mean pnl and std bands around that. Neither mean nor std work for this. Can someone implement these functions for series, or help me do it right? Or is there a better way? It seems the implementation for mean is as simple as removing _ensure_numeric in core/nanops.py. As for nanvar, I'm really not sure how to 1) use numpy functions to calculate what I need and 2) extend the function to accept dtype object without making it more likely to give cryptic errors when someone accidentally uses it with objects. (E.g. Pandas seems to be careful to throw meaningful Value and TypeErrors when it can. Amateurishly loosing restrictions defeats that.) Regards, Tyler -------------- next part -------------- An HTML attachment was scrubbed... URL: From th020394 at gmail.com Sat Oct 7 02:04:12 2017 From: th020394 at gmail.com (Tyler Hardin) Date: Sat, 7 Oct 2017 02:04:12 -0400 Subject: [Pandas-dev] Mean, stdev, and var for array elements In-Reply-To: References: Message-ID: That was a bad example. Better: import pandas as pd a = pd.Series([1, 2, 3, 4]) * 1. b = pd.Series([1, 2, 3, 4]) * 2. c = pd.Series([1, 2, 3, 4]) * 3. d = pd.Series([1, 2, 3, 4]) * 4. df = pd.DataFrame({ 'date' : ['20170103'] * 4, 'stock' : ['AAPL', 'GOOG', 'MSFT', 'TSLA'], 'pnl_curve' : [a,b,c,d] }) def proc_grp(grp): return pd.DataFrame({'pnl_curve' : grp.pnl_curve.sum()}) print(df.groupby('date').apply(proc_grp)) Output: pnl_curve date 20170103 0 10.0 1 20.0 2 30.0 3 40.0 The goal is meaningful dimensionality reduction with curves. On Sat, Oct 7, 2017 at 1:55 AM, Tyler Hardin wrote: > Hi, > > I'd really like to be able to calculate the mean, stdev, and var across > within cells of a dataframe. It already works as I expect it to with sum. > > Example: > > import pandas as pd > > a = pd.Series([1, 2, 3, 4]) * 1. > b = pd.Series([1, 2, 3, 4]) * 2. > c = pd.Series([1, 2, 3, 4]) * 3. > d = pd.Series([1, 2, 3, 4]) * 4. > > df = pd.DataFrame({'a' : [a,b,c,d]}, index=[0, 1, 2, 3]) > > print(df.a.sum()) > > Output: > > 0 10.0 > 1 20.0 > 2 30.0 > 3 40.0 > dtype: float64 > > This is very useful for embedding a third dimension within a single column > (because it's only needed there) instead of going full multi-index. > > For example, say you have a dataframe indexed on (date, stock) and in the > dataframe you have columns for close pnl, close gmv, etc. Further, say you > have a pnl_curve column, a minute-indexed (intraday) timeseries (again, > unique per date, stock). As in, each (date, stock) has an associated > intraday pnl curve (pd.Series object) in the column. > > From that setup, I want to reduce away the stock dimension. I might want > to sum the pnl curves (to get overall intraday pnl curves for each date). > This actually works already. (As simple as df.pnl_curve.sum()). But I'd > also like to plot the mean pnl and std bands around that. Neither mean nor > std work for this. > > Can someone implement these functions for series, or help me do it right? > Or is there a better way? > > It seems the implementation for mean is as simple as removing > _ensure_numeric in core/nanops.py. As for nanvar, I'm really not sure how > to 1) use numpy functions to calculate what I need and 2) extend the > function to accept dtype object without making it more likely to give > cryptic errors when someone accidentally uses it with objects. (E.g. Pandas > seems to be careful to throw meaningful Value and TypeErrors when it can. > Amateurishly loosing restrictions defeats that.) > > Regards, > Tyler > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.w.augspurger at gmail.com Mon Oct 16 12:29:41 2017 From: tom.w.augspurger at gmail.com (Tom Augspurger) Date: Mon, 16 Oct 2017 11:29:41 -0500 Subject: [Pandas-dev] ANN: pandas v0.21.0rc1 - RELEASE CANDIDATE Message-ID: Hi, I'm pleased to announce the availability of the first release candidate of pandas 0.21.0. *Please try this RC and report any issues on the pandas issue tracker . We will be releasing 0.21.0 final in 1-2 weeks.* This is a major release from 0.20.3 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version. Highlights include: - Integration with Apache Parquet , including a new top-level read_parquet function and DataFrame.to_parquet method, see here - New user-facing pandas.api.types.CategoricalDtype for specifying categoricals independent of the data, see here - The behavior of sum and prod on all-NaN Series/DataFrames is now consistent and no longer depends on whether bottleneck is installed, see here - Compatibility fixes for pypy, see here Check the whatsnew for detailed changes, including backwards incompatible changes and deprecations. Please report any issues you find on the pandas issue tracker . The release candidate can be installed with conda from our development channel (builds for osx-64, linux-64 and for Python 2.7, and Python 3.6 are all available): conda install -c pandas pandas=0.21.0rc1 Or via PyPI pip install --upgrade pip setuptools pip install --pre --upgrade --upgrade-strategy=only-if-needed pandas Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.w.augspurger at gmail.com Sat Oct 28 14:53:32 2017 From: tom.w.augspurger at gmail.com (Tom Augspurger) Date: Sat, 28 Oct 2017 13:53:32 -0500 Subject: [Pandas-dev] ANN: Pandas 0.21.0 Released Message-ID: Hi, I'm pleased to announce the availability pandas 0.21.0. This is a major release from 0.20.3 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version. Highlights include: - Integration with Apache Parquet , including a new top-level read_parquet function and a DataFrame.to_parquet method, see here . - New user-facing dtype pandas.api.types.CategoricalDtype for specifying categoricals independent of the data, see here . - The behavior of sum and prod on all-NaN Series/DataFrames is now consistent and no longer depends on whether bottleneck is installed, see here . - Compatibility fixes for pypy, see here . - Additions to the drop, reindex, and rename API to make them more consistent, see here - Addition of the new methods DataFrame.infer_objects (see here ) and GroupBy.pipe (see here ). - Indexing with a list of labels, where one or more of the labels is missing, is deprecated and will raise a KeyError in a future version, see here . Check the whatsnew for detailed changes, including backwards incompatible changes and deprecations . Please report any issues you find on the pandas issue tracker . Binary packages will be available in the defaults and conda-forge channels shortly. conda install pandas Wheels and a source distribution are available on PyPI. pip install --upgrade pip setuptools pip install --upgrade --upgrade-strategy=only-if-needed pandas Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: