From stefan.pankoke at googlemail.com  Fri May 15 16:17:57 2020
From: stefan.pankoke at googlemail.com (Dr. Leo)
Date: Fri, 15 May 2020 22:17:57 +0200
Subject: [Pandas-dev] [ANN] pandaSDMX 1.0.0 released
In-Reply-To: <CAKYSSq+ZYdpQ7cK2oJx5BpOKGntYN7Kkk+nFRfHa2Cws0K4TJg@mail.gmail.com>
References: <CAKYSSq+ZYdpQ7cK2oJx5BpOKGntYN7Kkk+nFRfHa2Cws0K4TJg@mail.gmail.com>
Message-ID: <d271ce4a-e312-545a-7df4-4ea08d2838d9@gmail.com>

Hi,


Two years after the 0.9 release I am pleased to announce the availability of
pandaSDMX 1.0.0. This is a major feature release including rewrites in 
virtually all areas. Certain backwards-incompatible API changes appeared 
inevitable, but are largely outweighed by a host
of new and enhanced features. Highlights include:

? more complete and accurate implementation of the SDMX information 
model including
hierarchical code-lists and facets. This feature was overdue. It will
considerably facilitate the interpretation and representation of data
and metadata.
? Better handling of idiosyncrasies of data sources should ease data
acquisition in corner-cases
? Streamlined API and more informative string representations for
interactive data acquisition and exploration
? The information model has been decoupled from the XML and JSON readers so
that arbitrary data sources outside the SDMX ecosystem can be
embedded programmatically. This shift in architecture could
eventually seed the transformation of pandaSDMX from a pure client
library to an end-to-end SDMX platform for the generation of SDMX
files served over HTTP.
? Easier and more flexible configuration of HTTP connections through
user-provided requests Sessions
? vastly extended test suite and streamlined documentation
? Modern code base leveraging typing and pydantic


Quick start and links
-------------------------
? Installation (requires Python 3.7):

$ pip install pandasdmx

? Documentation:

https:pandasdmx.readthedocs.io/

? Github:

https://github.com/dr-leo/pandaSDMX

Roadmap
---------

? add an intake driver/plugin exposing SDMX datasets and metadata
? provide conda package
InternationalStrings: re-implement support for locale selection
? support SDMXJSON structure-messages which have recently been added to 
the SDMX standard
? Fix a few known issues

Help wanted!


Credits
-----------

Many great people have generously contributed to this release. Even
the lion's share of the development was temporarily shouldered by a
single collaborator. A big thanks to all of them!


What is pandaSDMX?
----------------------
pandaSDMX is an Apache 2.0-licensed Python library that implements
SDMX 2.1 (ISO 17369:2013), a format for exchange of statistical data
and metadata used by national statistical agencies, central banks, and
international organisations.

pandaSDMX can be used to:
? explore the data and metadata available from many data providers
such as the World Bank, International Monetary Fund, Eurostat, the
ECB, OECD, and United Nations;
? parse data and metadata in SDMX-ML (XML) or SDMX-JSON formats?either:
o from local files, or
o retrieved from SDMX web services, with query validation and caching;
? convert data and metadata into pandas objects, for use with the
analysis, plotting, and other tools in the Python data ecosystem.

From stefan.pankoke at googlemail.com  Fri May 15 16:26:01 2020
From: stefan.pankoke at googlemail.com (Dr. Leo)
Date: Fri, 15 May 2020 22:26:01 +0200
Subject: [Pandas-dev] [ANN] pandaSDMX 1.0.0 released
Message-ID: <32fe3f67-ff93-4909-efce-26996df386c2@gmail.com>

[ANN] pandaSDMX 1.0.0 released
Hi,

Two years after the 0.9 release I am pleased to announce the 
availability of
pandaSDMX 1.0.0. This is a major feature release including rewrites in 
virtually all areas. Certain backwards-incompatible API changes appeared 
inevitable,
but are largely outweighed by a host
of new and enhanced features. Highlights include:

? more complete and accurate implementation of the SDMX information 
model including
hierarchical code-lists and facets. This feature was overdue. It will
considerably facilitate the interpretation and representation of data
and metadata.
? Better handling of idiosyncrasies of data sources should ease data
acquisition in corner-cases
? Streamlined API and more informative string representations for
interactive data acquisition and exploration
? The information model has been decoupled from the XML and JSON readers so
that arbitrary data sources outside the SDMX ecosystem can be
embedded programmatically. This shift in architecture could
eventually seed the transformation of pandaSDMX from a pure client
library to an end-to-end SDMX platform for the generation of SDMX
files served over HTTP.
? Easier and more flexible configuration of HTTP connections through
user-provided requests Sessions
? vastly extended test suite and streamlined documentation
? Modern code base leveraging typing and pydantic

Quick start and links
-------------------------
? Installation (requires Python 3.7):

$ pip install pandasdmx

? Documentation:

https:pandasdmx.readthedocs.io/

? Github:

https://github.com/dr-leo/pandaSDMX

Roadmap
---------

? add an intake driver/plugin exposing SDMX datasets and metadata
? provide conda package
InternationalStrings: re-implement support for locale selection
? support SDMXJSON structure-messages which have recently been added to 
the SDMX standard
? Fix a few known issues

Help wanted!

Credits
-----------

Many great people have generously contributed to this release. Even
the lion's share of the development was temporarily shouldered by a
single collaborator. A big thanks to all of them!

What is pandaSDMX?
----------------------
pandaSDMX is an Apache 2.0-licensed Python library that implements
SDMX 2.1 (ISO 17369:2013), a format for exchange of statistical data
and metadata used by national statistical agencies, central banks, and
international organisations.

pandaSDMX can be used to:
? explore the data and metadata available from many data providers
such as the World Bank, International Monetary Fund, Eurostat, the
ECB, OECD, and United Nations;
? parse data and metadata in SDMX-ML (XML) or SDMX-JSON formats?either:
o from local files, or
o retrieved from SDMX web services, with query validation and caching;
? convert data and metadata into pandas objects, for use with the
analysis, plotting, and other tools in the Python data ecosystem.


From jorisvandenbossche at gmail.com  Mon May 25 17:39:13 2020
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Mon, 25 May 2020 23:39:13 +0200
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
Message-ID: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>

Hi list,

Rewriting the BlockManager based on a simpler collection of 1D-arrays is
actually on our roadmap (see here
<https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite>),
and I also touched on it in a mailing list discussion about pandas 2.0
earlier this year (see here
<https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>).

But since the topic came up again recently at the last online dev meeting
(and also Uwe Korn who wrote a nice blog post
<https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this
yesterday), I thought to do a write-up of my thoughts on why I think we
should actually move towards a simpler, non-consolidating BlockManager with
1D blocks.


*Simplication of the internals*

It's regularly brought up as a reason to have 2D EextensionArrays (EAs)
because right now we have a lot of special cases for 1D EAs in the
internals. But to be clear: the additional complexity does not come from 1D
EAs in itself, it comes from the fact that we have a mixture of 2D and 1D
blocks.
Solving this would require a consistent block dimension, and thus removing
this added complexity can be done in two ways: have all 1D blocks, or have
all 2D blocks.
Just to say: IMO, this is not an argument in favor of 2D blocks /
consolidation.

Moreover, when going with all 1D blocks, we cannot only remove the added
complexity from dealing with the mixture of 1D/2D blocks, we will *also* be
able to reduce the complexity of dealing with 2D blocks. A BlockManager
with 2D blocks is inherently more complex than with 1D blocks, as one needs
to deal with proper alignment of the blocks, a more complex "placement"
logic of the blocks, etc.

I think we would be able to simplify the internals a lot by going with a
BlockManager as a store of 1D arrays.


*Performance*

Performance is typically given as a reason to have consolidated, 2D blocks.
And of course, certain operations (especially row-wise operations, or on
dataframes with more columns as rows) will always be faster when done on a
2D numpy array under the hood.
However, based on recent experimentation with this (eg triggered by
the block-wise
frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see
also some benchmarks I justed posted in #10556
<https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160> /
 this gist
<https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>),
I also think that for many operations and with decent-sized dataframes,
this performance penalty is actually quite OK.

Further, there are also operations that will *benefit* from 1D blocks.
First, operations that now involve aligning/splitting blocks,
re-consolidation, .. will benefit (e.g. a large part of the slowdown doing
frame/frame operations column-wise is currently due to the consolidation in
the end). And operations like adding a column, concatting (with axis=1) or
merging dataframes will be much faster when no consolidation is needed.

Personally, I am convinced that with some effort, we can get on-par or
sometimes even better performance with 1D blocks compared to the
performance we have now for those cases that 90+% of our users care about:

   - With limited effort optimizing the column-wise code paths in the
   internals, we can get a long way.
   - After that, if needed, we can still consider if parts of the internals
   could be cythonized to further improve certain bottlenecks (and actually
   cythonizing this will also be simpler for a simpler non-consolidating block
   manager).


*Possibility to get better copy/view semantics*

Pandas is badly known for how much it copies ("you need 10x the memory
available as the size of your dataframe"), and having 1D blocks will allow
us to address part of those concerns.

*No consolidation = less copying.* Regularly consolidating introduces
copies, and thus removing consolidation will mean less copies. For example,
this would enable that you can actually add a single column to a dataframe
without having to copy to the full dataframe.

*Copy / view semantics* Recently there has been discussion again around
whether selecting columns should be a copy or a view, and some other issues
were opened with questions about views/copies when slicing columns. In the
consolidated 2D block layout this will always be inherently messy, and
unpredictable (meaning: depending on the actual block layout, which means
in practice unpredictable for the user unaware of the block layout).
Going with a non-consolidated BlockManager should at least allow us to get
better / more understandable semantics around this.


------------------------------

*So what are the reasons to have 2D blocks?*

I personally don't directly see reasons to have 2D blocks *for pandas
itself* (apart from performance in certain row-wise use cases, and except
for the fact that we have "always done it like this"). But quite likely I
am missing reasons, so please bring them up.

But I think there are certainly use cases where 2D blocks can be useful,
but typically "external" (but nonetheless important) use cases: conversion
to/from numpy, xarray, etc. A typical example that has recently come up is
scikit-learn, where they want to have a cheap dataframe <-> numpy array
roundtrip for use in their pipelines.
However, I personally think there are possible ways that we can still
accommodate for those use cases, with some effort, while still having 1D
Blocks in pandas itself. So IMO this is not sufficient to warrant the
complexity of 2D blocks in pandas.
(but will stop here, as this mail is getting already long ..).

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200525/2e31dcd1/attachment.html>

From jbrockmendel at gmail.com  Mon May 25 18:45:57 2020
From: jbrockmendel at gmail.com (Brock Mendel)
Date: Mon, 25 May 2020 15:45:57 -0700
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
Message-ID: <CAKf8g9Q+Wnmu-8TqeK=fNY-7Queh_-=fhx5MUkB4=YD4FpbzpQ@mail.gmail.com>

Thanks for writing this up, Joris.  Assuming we go down this path, do you
have an idea of how we get from here to there incrementally?  i.e.
presumably this wont just be one massive PR

On Mon, May 25, 2020 at 2:39 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> Hi list,
>
> Rewriting the BlockManager based on a simpler collection of 1D-arrays is
> actually on our roadmap (see here
> <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite>),
> and I also touched on it in a mailing list discussion about pandas 2.0
> earlier this year (see here
> <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>).
>
> But since the topic came up again recently at the last online dev meeting
> (and also Uwe Korn who wrote a nice blog post
> <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this
> yesterday), I thought to do a write-up of my thoughts on why I think we
> should actually move towards a simpler, non-consolidating BlockManager with
> 1D blocks.
>
>
> *Simplication of the internals*
>
> It's regularly brought up as a reason to have 2D EextensionArrays (EAs)
> because right now we have a lot of special cases for 1D EAs in the
> internals. But to be clear: the additional complexity does not come from 1D
> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D
> blocks.
> Solving this would require a consistent block dimension, and thus removing
> this added complexity can be done in two ways: have all 1D blocks, or have
> all 2D blocks.
> Just to say: IMO, this is not an argument in favor of 2D blocks /
> consolidation.
>
> Moreover, when going with all 1D blocks, we cannot only remove the added
> complexity from dealing with the mixture of 1D/2D blocks, we will *also* be
> able to reduce the complexity of dealing with 2D blocks. A BlockManager
> with 2D blocks is inherently more complex than with 1D blocks, as one needs
> to deal with proper alignment of the blocks, a more complex "placement"
> logic of the blocks, etc.
>
> I think we would be able to simplify the internals a lot by going with a
> BlockManager as a store of 1D arrays.
>
>
> *Performance*
>
> Performance is typically given as a reason to have consolidated, 2D
> blocks. And of course, certain operations (especially row-wise operations,
> or on dataframes with more columns as rows) will always be faster when done
> on a 2D numpy array under the hood.
> However, based on recent experimentation with this (eg triggered by the block-wise
> frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see
> also some benchmarks I justed posted in #10556
> <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160>
>  / this gist
> <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>),
> I also think that for many operations and with decent-sized dataframes,
> this performance penalty is actually quite OK.
>
> Further, there are also operations that will *benefit* from 1D blocks.
> First, operations that now involve aligning/splitting blocks,
> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing
> frame/frame operations column-wise is currently due to the consolidation in
> the end). And operations like adding a column, concatting (with axis=1) or
> merging dataframes will be much faster when no consolidation is needed.
>
> Personally, I am convinced that with some effort, we can get on-par or
> sometimes even better performance with 1D blocks compared to the
> performance we have now for those cases that 90+% of our users care about:
>
>    - With limited effort optimizing the column-wise code paths in the
>    internals, we can get a long way.
>    - After that, if needed, we can still consider if parts of the
>    internals could be cythonized to further improve certain bottlenecks (and
>    actually cythonizing this will also be simpler for a simpler
>    non-consolidating block manager).
>
>
> *Possibility to get better copy/view semantics*
>
> Pandas is badly known for how much it copies ("you need 10x the memory
> available as the size of your dataframe"), and having 1D blocks will allow
> us to address part of those concerns.
>
> *No consolidation = less copying.* Regularly consolidating introduces
> copies, and thus removing consolidation will mean less copies. For example,
> this would enable that you can actually add a single column to a dataframe
> without having to copy to the full dataframe.
>
> *Copy / view semantics* Recently there has been discussion again around
> whether selecting columns should be a copy or a view, and some other issues
> were opened with questions about views/copies when slicing columns. In the
> consolidated 2D block layout this will always be inherently messy, and
> unpredictable (meaning: depending on the actual block layout, which means
> in practice unpredictable for the user unaware of the block layout).
> Going with a non-consolidated BlockManager should at least allow us to get
> better / more understandable semantics around this.
>
>
> ------------------------------
>
> *So what are the reasons to have 2D blocks?*
>
> I personally don't directly see reasons to have 2D blocks *for pandas
> itself* (apart from performance in certain row-wise use cases, and except
> for the fact that we have "always done it like this"). But quite likely I
> am missing reasons, so please bring them up.
>
> But I think there are certainly use cases where 2D blocks can be useful,
> but typically "external" (but nonetheless important) use cases: conversion
> to/from numpy, xarray, etc. A typical example that has recently come up is
> scikit-learn, where they want to have a cheap dataframe <-> numpy array
> roundtrip for use in their pipelines.
> However, I personally think there are possible ways that we can still
> accommodate for those use cases, with some effort, while still having 1D
> Blocks in pandas itself. So IMO this is not sufficient to warrant the
> complexity of 2D blocks in pandas.
> (but will stop here, as this mail is getting already long ..).
>
> Joris
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200525/686fcf9d/attachment-0001.html>

From adrin.jalali at gmail.com  Tue May 26 03:50:37 2020
From: adrin.jalali at gmail.com (Adrin)
Date: Tue, 26 May 2020 09:50:37 +0200
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
Message-ID: <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>

Hi Joris,

Thanks for the summary. I think another missing point is the roundtrip
conversion to/from sparse matrices.
There are some benchmarks and discussion here;
https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097
and here's some discussion on the pandas issue tracker:
https://github.com/pandas-dev/pandas/issues/33182
and some benchmark by Tom, assuming pandas would accept a 2D sparse array:
https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615440896

What do you think of these usecases?

Thanks,
Adrin

On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> Hi list,
>
> Rewriting the BlockManager based on a simpler collection of 1D-arrays is
> actually on our roadmap (see here
> <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite>),
> and I also touched on it in a mailing list discussion about pandas 2.0
> earlier this year (see here
> <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>).
>
> But since the topic came up again recently at the last online dev meeting
> (and also Uwe Korn who wrote a nice blog post
> <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this
> yesterday), I thought to do a write-up of my thoughts on why I think we
> should actually move towards a simpler, non-consolidating BlockManager with
> 1D blocks.
>
>
> *Simplication of the internals*
>
> It's regularly brought up as a reason to have 2D EextensionArrays (EAs)
> because right now we have a lot of special cases for 1D EAs in the
> internals. But to be clear: the additional complexity does not come from 1D
> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D
> blocks.
> Solving this would require a consistent block dimension, and thus removing
> this added complexity can be done in two ways: have all 1D blocks, or have
> all 2D blocks.
> Just to say: IMO, this is not an argument in favor of 2D blocks /
> consolidation.
>
> Moreover, when going with all 1D blocks, we cannot only remove the added
> complexity from dealing with the mixture of 1D/2D blocks, we will *also* be
> able to reduce the complexity of dealing with 2D blocks. A BlockManager
> with 2D blocks is inherently more complex than with 1D blocks, as one needs
> to deal with proper alignment of the blocks, a more complex "placement"
> logic of the blocks, etc.
>
> I think we would be able to simplify the internals a lot by going with a
> BlockManager as a store of 1D arrays.
>
>
> *Performance*
>
> Performance is typically given as a reason to have consolidated, 2D
> blocks. And of course, certain operations (especially row-wise operations,
> or on dataframes with more columns as rows) will always be faster when done
> on a 2D numpy array under the hood.
> However, based on recent experimentation with this (eg triggered by the block-wise
> frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see
> also some benchmarks I justed posted in #10556
> <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160>
>  / this gist
> <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>),
> I also think that for many operations and with decent-sized dataframes,
> this performance penalty is actually quite OK.
>
> Further, there are also operations that will *benefit* from 1D blocks.
> First, operations that now involve aligning/splitting blocks,
> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing
> frame/frame operations column-wise is currently due to the consolidation in
> the end). And operations like adding a column, concatting (with axis=1) or
> merging dataframes will be much faster when no consolidation is needed.
>
> Personally, I am convinced that with some effort, we can get on-par or
> sometimes even better performance with 1D blocks compared to the
> performance we have now for those cases that 90+% of our users care about:
>
>    - With limited effort optimizing the column-wise code paths in the
>    internals, we can get a long way.
>    - After that, if needed, we can still consider if parts of the
>    internals could be cythonized to further improve certain bottlenecks (and
>    actually cythonizing this will also be simpler for a simpler
>    non-consolidating block manager).
>
>
> *Possibility to get better copy/view semantics*
>
> Pandas is badly known for how much it copies ("you need 10x the memory
> available as the size of your dataframe"), and having 1D blocks will allow
> us to address part of those concerns.
>
> *No consolidation = less copying.* Regularly consolidating introduces
> copies, and thus removing consolidation will mean less copies. For example,
> this would enable that you can actually add a single column to a dataframe
> without having to copy to the full dataframe.
>
> *Copy / view semantics* Recently there has been discussion again around
> whether selecting columns should be a copy or a view, and some other issues
> were opened with questions about views/copies when slicing columns. In the
> consolidated 2D block layout this will always be inherently messy, and
> unpredictable (meaning: depending on the actual block layout, which means
> in practice unpredictable for the user unaware of the block layout).
> Going with a non-consolidated BlockManager should at least allow us to get
> better / more understandable semantics around this.
>
>
> ------------------------------
>
> *So what are the reasons to have 2D blocks?*
>
> I personally don't directly see reasons to have 2D blocks *for pandas
> itself* (apart from performance in certain row-wise use cases, and except
> for the fact that we have "always done it like this"). But quite likely I
> am missing reasons, so please bring them up.
>
> But I think there are certainly use cases where 2D blocks can be useful,
> but typically "external" (but nonetheless important) use cases: conversion
> to/from numpy, xarray, etc. A typical example that has recently come up is
> scikit-learn, where they want to have a cheap dataframe <-> numpy array
> roundtrip for use in their pipelines.
> However, I personally think there are possible ways that we can still
> accommodate for those use cases, with some effort, while still having 1D
> Blocks in pandas itself. So IMO this is not sufficient to warrant the
> complexity of 2D blocks in pandas.
> (but will stop here, as this mail is getting already long ..).
>
> Joris
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/4359638b/attachment-0001.html>

From jorisvandenbossche at gmail.com  Tue May 26 04:35:17 2020
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Tue, 26 May 2020 10:35:17 +0200
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
Message-ID: <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>

Thanks for those links!

Personally, I see the "roundtrip conversion to/from sparse matrices" a bit
as in the same bucket as conversion to/from a 2D numpy array.
Yes, both are important use cases. But the question we need to ask
ourselves is still: is this important enough to hugely complicate the
pandas' internals and block several other improvements? It's a trade-off
that we need to make.

Moreover, I think that we could accommodate the important part of those use
cases also with a column-store DataFrame, with some effort (but with less
complexity as a consolidated BlockManager).

Focusing on scikit-learn: in the end, you mostly care about cheap
roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame
to carry feature labels in between steps of a pipeline, correct?
Such cheap roundtripping is only possible anyway if you have a single dtype
for all columns (which is typically the case after some transformation
step). So you don't necessarily need consolidated blocks specifically, but
rather the ability to store a *single* 2D array/matrix in a DataFrame (so
kind of a single 2D block).

Thinking out loud here, didn't try anything in code:

- We could make the DataFrame construction from a 2D array/matrix kind of
"lazy" (or have an option to do it like this): upon construction just store
the 2D array as is, and only once you perform an actual operation on it,
convert to a columnar store. And that would make it possible to still get
the 2D array back with zero-copy, if all you did was passing this DataFrame
to the next step of the pipeline.
- We could take the above a step further and try to preserve the 2D array
under the hood in some "easy" operations (but again, limited to a single 2D
block/array, not multiple consolidated blocks). This is actually similar to
the DataMatrix that pandas had a very long time ago. Of course this adds
back complexity, so this would need some more exploration to see if how
this would be possible (without duplicating a lot), and some buy-in from
people interested in this.

I think the first option should be fairly easy to do, and should solve a
large part of the concerns for scikit-learn (I think?).

I think the second idea is also interesting: IMO such a data structure
would be useful to have somewhere in the PyData ecosystem, and a worthwhile
discussion to think about where this could fit. Maybe the answer is simply:
use xarray for this use case (although there are still differences) ? That
are interesting discussions, but personally I would not complicate the core
pandas data model for heterogeneous dataframes to accommodate the
single-dtype + fixed number of columns use case.

Joris

On Tue, 26 May 2020 at 09:50, Adrin <adrin.jalali at gmail.com> wrote:

> Hi Joris,
>
> Thanks for the summary. I think another missing point is the roundtrip
> conversion to/from sparse matrices.
> There are some benchmarks and discussion here;
> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097
> and here's some discussion on the pandas issue tracker:
> https://github.com/pandas-dev/pandas/issues/33182
> and some benchmark by Tom, assuming pandas would accept a 2D sparse array:
> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615440896
>
> What do you think of these usecases?
>
> Thanks,
> Adrin
>
> On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> Hi list,
>>
>> Rewriting the BlockManager based on a simpler collection of 1D-arrays is
>> actually on our roadmap (see here
>> <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite>),
>> and I also touched on it in a mailing list discussion about pandas 2.0
>> earlier this year (see here
>> <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>
>> ).
>>
>> But since the topic came up again recently at the last online dev meeting
>> (and also Uwe Korn who wrote a nice blog post
>> <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this
>> yesterday), I thought to do a write-up of my thoughts on why I think we
>> should actually move towards a simpler, non-consolidating BlockManager with
>> 1D blocks.
>>
>>
>> *Simplication of the internals*
>>
>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs)
>> because right now we have a lot of special cases for 1D EAs in the
>> internals. But to be clear: the additional complexity does not come from 1D
>> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D
>> blocks.
>> Solving this would require a consistent block dimension, and thus
>> removing this added complexity can be done in two ways: have all 1D blocks,
>> or have all 2D blocks.
>> Just to say: IMO, this is not an argument in favor of 2D blocks /
>> consolidation.
>>
>> Moreover, when going with all 1D blocks, we cannot only remove the added
>> complexity from dealing with the mixture of 1D/2D blocks, we will *also* be
>> able to reduce the complexity of dealing with 2D blocks. A BlockManager
>> with 2D blocks is inherently more complex than with 1D blocks, as one needs
>> to deal with proper alignment of the blocks, a more complex "placement"
>> logic of the blocks, etc.
>>
>> I think we would be able to simplify the internals a lot by going with a
>> BlockManager as a store of 1D arrays.
>>
>>
>> *Performance*
>>
>> Performance is typically given as a reason to have consolidated, 2D
>> blocks. And of course, certain operations (especially row-wise operations,
>> or on dataframes with more columns as rows) will always be faster when done
>> on a 2D numpy array under the hood.
>> However, based on recent experimentation with this (eg triggered by the block-wise
>> frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see
>> also some benchmarks I justed posted in #10556
>> <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160>
>>  / this gist
>> <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>),
>> I also think that for many operations and with decent-sized dataframes,
>> this performance penalty is actually quite OK.
>>
>> Further, there are also operations that will *benefit* from 1D blocks.
>> First, operations that now involve aligning/splitting blocks,
>> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing
>> frame/frame operations column-wise is currently due to the consolidation in
>> the end). And operations like adding a column, concatting (with axis=1) or
>> merging dataframes will be much faster when no consolidation is needed.
>>
>> Personally, I am convinced that with some effort, we can get on-par or
>> sometimes even better performance with 1D blocks compared to the
>> performance we have now for those cases that 90+% of our users care about:
>>
>>    - With limited effort optimizing the column-wise code paths in the
>>    internals, we can get a long way.
>>    - After that, if needed, we can still consider if parts of the
>>    internals could be cythonized to further improve certain bottlenecks (and
>>    actually cythonizing this will also be simpler for a simpler
>>    non-consolidating block manager).
>>
>>
>> *Possibility to get better copy/view semantics*
>>
>> Pandas is badly known for how much it copies ("you need 10x the memory
>> available as the size of your dataframe"), and having 1D blocks will allow
>> us to address part of those concerns.
>>
>> *No consolidation = less copying.* Regularly consolidating introduces
>> copies, and thus removing consolidation will mean less copies. For example,
>> this would enable that you can actually add a single column to a dataframe
>> without having to copy to the full dataframe.
>>
>> *Copy / view semantics* Recently there has been discussion again around
>> whether selecting columns should be a copy or a view, and some other issues
>> were opened with questions about views/copies when slicing columns. In the
>> consolidated 2D block layout this will always be inherently messy, and
>> unpredictable (meaning: depending on the actual block layout, which means
>> in practice unpredictable for the user unaware of the block layout).
>> Going with a non-consolidated BlockManager should at least allow us to
>> get better / more understandable semantics around this.
>>
>>
>> ------------------------------
>>
>> *So what are the reasons to have 2D blocks?*
>>
>> I personally don't directly see reasons to have 2D blocks *for pandas
>> itself* (apart from performance in certain row-wise use cases, and
>> except for the fact that we have "always done it like this"). But quite
>> likely I am missing reasons, so please bring them up.
>>
>> But I think there are certainly use cases where 2D blocks can be useful,
>> but typically "external" (but nonetheless important) use cases: conversion
>> to/from numpy, xarray, etc. A typical example that has recently come up is
>> scikit-learn, where they want to have a cheap dataframe <-> numpy array
>> roundtrip for use in their pipelines.
>> However, I personally think there are possible ways that we can still
>> accommodate for those use cases, with some effort, while still having 1D
>> Blocks in pandas itself. So IMO this is not sufficient to warrant the
>> complexity of 2D blocks in pandas.
>> (but will stop here, as this mail is getting already long ..).
>>
>> Joris
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/7a643829/attachment-0001.html>

From jorisvandenbossche at gmail.com  Tue May 26 04:55:08 2020
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Tue, 26 May 2020 10:55:08 +0200
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CAKf8g9Q+Wnmu-8TqeK=fNY-7Queh_-=fhx5MUkB4=YD4FpbzpQ@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAKf8g9Q+Wnmu-8TqeK=fNY-7Queh_-=fhx5MUkB4=YD4FpbzpQ@mail.gmail.com>
Message-ID: <CALQtMBat9bT0GG3RP8bCnmro6Na2S=MwvnsquK=peXSRoTN-FQ@mail.gmail.com>

On Tue, 26 May 2020 at 00:46, Brock Mendel <jbrockmendel at gmail.com> wrote:

> Thanks for writing this up, Joris.  Assuming we go down this path, do you
> have an idea of how we get from here to there incrementally?  i.e.
> presumably this wont just be one massive PR
>

Yes, this is certainly not a one-PR change. I think there are multiple
options for working towards this, that are worth discussing.

But personally, I would first like to focus on the "assuming we go down
this path" part. Let's discuss the pros and cons and trade-offs, and try to
turn assumptions in an agreed-upon roadmap.
(and of course, it's not because something is on our roadmap that it can't
be questioned and discussed again in the future, as we are also doing now).

---

Some thoughts on possible options:

- We briefly discussed before the idea of using (nullable) extension dtypes
for all dtypes by default in pandas 2.0. If we strive towards that, and
assuming we keep the current 1D-restriction on ExtensionBlock, then we
would "automatically" get a BlockManager with 1D blocks. And we could then
focus on optimizing some code paths (eg constructing a new block)
specifically for the case of 1D ExtensionBlocks.
- A "consolidation policy" option similarly as in the branch discussed in
https://github.com/pandas-dev/pandas/issues/10556. Right now, that branch
still uses 2D blocks (but separate 2D blocks of shape (1, n) per column)
and not actually 1D blocks. So we could add 1D versions of our numeric
blocks as well. But that would probably add a lot of complexity, although
temporary, to the Blocks, so maybe not an ideal path forward.
- Add a version of the ExtensionBlock but that can work with numpy arrays
instead of extension arrays, or actually use the "PandasArrays" to store it
them in the existing ExtensionBlock (so to already start using the existing
1D blocks without requiring all dtypes to be extension dtypes).

Those are all about BlockManager with 1D blocks. Once we only have 1D
Blocks, I suppose there are many things we could simplify in the current
BlockManager. The intermediate step of the current BlockManager with 1D
blocks might not be an optimal situation, but seems the easiest as
intermediate goal in practice.

It probably also depends on how much "backwards compatibility" or
"transition period" we want to provide.


> On Mon, May 25, 2020 at 2:39 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> Hi list,
>>
>> Rewriting the BlockManager based on a simpler collection of 1D-arrays is
>> actually on our roadmap (see here
>> <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite>),
>> and I also touched on it in a mailing list discussion about pandas 2.0
>> earlier this year (see here
>> <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>
>> ).
>>
>> But since the topic came up again recently at the last online dev meeting
>> (and also Uwe Korn who wrote a nice blog post
>> <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this
>> yesterday), I thought to do a write-up of my thoughts on why I think we
>> should actually move towards a simpler, non-consolidating BlockManager with
>> 1D blocks.
>>
>>
>> *Simplication of the internals*
>>
>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs)
>> because right now we have a lot of special cases for 1D EAs in the
>> internals. But to be clear: the additional complexity does not come from 1D
>> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D
>> blocks.
>> Solving this would require a consistent block dimension, and thus
>> removing this added complexity can be done in two ways: have all 1D blocks,
>> or have all 2D blocks.
>> Just to say: IMO, this is not an argument in favor of 2D blocks /
>> consolidation.
>>
>> Moreover, when going with all 1D blocks, we cannot only remove the added
>> complexity from dealing with the mixture of 1D/2D blocks, we will *also* be
>> able to reduce the complexity of dealing with 2D blocks. A BlockManager
>> with 2D blocks is inherently more complex than with 1D blocks, as one needs
>> to deal with proper alignment of the blocks, a more complex "placement"
>> logic of the blocks, etc.
>>
>> I think we would be able to simplify the internals a lot by going with a
>> BlockManager as a store of 1D arrays.
>>
>>
>> *Performance*
>>
>> Performance is typically given as a reason to have consolidated, 2D
>> blocks. And of course, certain operations (especially row-wise operations,
>> or on dataframes with more columns as rows) will always be faster when done
>> on a 2D numpy array under the hood.
>> However, based on recent experimentation with this (eg triggered by the block-wise
>> frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see
>> also some benchmarks I justed posted in #10556
>> <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160>
>>  / this gist
>> <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>),
>> I also think that for many operations and with decent-sized dataframes,
>> this performance penalty is actually quite OK.
>>
>> Further, there are also operations that will *benefit* from 1D blocks.
>> First, operations that now involve aligning/splitting blocks,
>> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing
>> frame/frame operations column-wise is currently due to the consolidation in
>> the end). And operations like adding a column, concatting (with axis=1) or
>> merging dataframes will be much faster when no consolidation is needed.
>>
>> Personally, I am convinced that with some effort, we can get on-par or
>> sometimes even better performance with 1D blocks compared to the
>> performance we have now for those cases that 90+% of our users care about:
>>
>>    - With limited effort optimizing the column-wise code paths in the
>>    internals, we can get a long way.
>>    - After that, if needed, we can still consider if parts of the
>>    internals could be cythonized to further improve certain bottlenecks (and
>>    actually cythonizing this will also be simpler for a simpler
>>    non-consolidating block manager).
>>
>>
>> *Possibility to get better copy/view semantics*
>>
>> Pandas is badly known for how much it copies ("you need 10x the memory
>> available as the size of your dataframe"), and having 1D blocks will allow
>> us to address part of those concerns.
>>
>> *No consolidation = less copying.* Regularly consolidating introduces
>> copies, and thus removing consolidation will mean less copies. For example,
>> this would enable that you can actually add a single column to a dataframe
>> without having to copy to the full dataframe.
>>
>> *Copy / view semantics* Recently there has been discussion again around
>> whether selecting columns should be a copy or a view, and some other issues
>> were opened with questions about views/copies when slicing columns. In the
>> consolidated 2D block layout this will always be inherently messy, and
>> unpredictable (meaning: depending on the actual block layout, which means
>> in practice unpredictable for the user unaware of the block layout).
>> Going with a non-consolidated BlockManager should at least allow us to
>> get better / more understandable semantics around this.
>>
>>
>> ------------------------------
>>
>> *So what are the reasons to have 2D blocks?*
>>
>> I personally don't directly see reasons to have 2D blocks *for pandas
>> itself* (apart from performance in certain row-wise use cases, and
>> except for the fact that we have "always done it like this"). But quite
>> likely I am missing reasons, so please bring them up.
>>
>> But I think there are certainly use cases where 2D blocks can be useful,
>> but typically "external" (but nonetheless important) use cases: conversion
>> to/from numpy, xarray, etc. A typical example that has recently come up is
>> scikit-learn, where they want to have a cheap dataframe <-> numpy array
>> roundtrip for use in their pipelines.
>> However, I personally think there are possible ways that we can still
>> accommodate for those use cases, with some effort, while still having 1D
>> Blocks in pandas itself. So IMO this is not sufficient to warrant the
>> complexity of 2D blocks in pandas.
>> (but will stop here, as this mail is getting already long ..).
>>
>> Joris
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/225f0c2c/attachment-0001.html>

From xhochy at gmail.com  Tue May 26 06:28:16 2020
From: xhochy at gmail.com (Uwe L. Korn)
Date: Tue, 26 May 2020 12:28:16 +0200
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CALQtMBat9bT0GG3RP8bCnmro6Na2S=MwvnsquK=peXSRoTN-FQ@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAKf8g9Q+Wnmu-8TqeK=fNY-7Queh_-=fhx5MUkB4=YD4FpbzpQ@mail.gmail.com>
 <CALQtMBat9bT0GG3RP8bCnmro6Na2S=MwvnsquK=peXSRoTN-FQ@mail.gmail.com>
Message-ID: <CAGSNw=DL_NpZU0kcrWuX=xf3i0h45CMM=CCgej3bLsUg11wcEg@mail.gmail.com>

Hello all,

thanks Joris for starting this thread.

For myself, I struggle a bit to understand the cases that are made for the
BlockManager benefits. The examples are mostly operations that act on two
full DataFrames like "df1 + df2" or come from the fact that one wants to
keep a single-type 2D matrix together with column labels but not acutally
make use of pandas functionality afterwards. In the code I write on a
day-to-day basis, we don't have these use cases thus I'm struggling to
understand the real-world benefit of having these operations supported as
efficiently as possible in pandas. Even when using scikit-learn pipelines,
we have for as long as possible heterogeneously typed DataFrames and only
convert to a single-type matrix as late as possible.

Thus can anyone enlighten me in which real-world use cases this needs to
supported in pandas?

Best
Uwe

Am Di., 26. Mai 2020 um 10:55 Uhr schrieb Joris Van den Bossche <
jorisvandenbossche at gmail.com>:

> On Tue, 26 May 2020 at 00:46, Brock Mendel <jbrockmendel at gmail.com> wrote:
>
>> Thanks for writing this up, Joris.  Assuming we go down this path, do you
>> have an idea of how we get from here to there incrementally?  i.e.
>> presumably this wont just be one massive PR
>>
>
> Yes, this is certainly not a one-PR change. I think there are multiple
> options for working towards this, that are worth discussing.
>
> But personally, I would first like to focus on the "assuming we go down
> this path" part. Let's discuss the pros and cons and trade-offs, and try to
> turn assumptions in an agreed-upon roadmap.
> (and of course, it's not because something is on our roadmap that it can't
> be questioned and discussed again in the future, as we are also doing now).
>
> ---
>
> Some thoughts on possible options:
>
> - We briefly discussed before the idea of using (nullable) extension
> dtypes for all dtypes by default in pandas 2.0. If we strive towards that,
> and assuming we keep the current 1D-restriction on ExtensionBlock, then we
> would "automatically" get a BlockManager with 1D blocks. And we could then
> focus on optimizing some code paths (eg constructing a new block)
> specifically for the case of 1D ExtensionBlocks.
> - A "consolidation policy" option similarly as in the branch discussed in
> https://github.com/pandas-dev/pandas/issues/10556. Right now, that branch
> still uses 2D blocks (but separate 2D blocks of shape (1, n) per column)
> and not actually 1D blocks. So we could add 1D versions of our numeric
> blocks as well. But that would probably add a lot of complexity, although
> temporary, to the Blocks, so maybe not an ideal path forward.
> - Add a version of the ExtensionBlock but that can work with numpy arrays
> instead of extension arrays, or actually use the "PandasArrays" to store it
> them in the existing ExtensionBlock (so to already start using the existing
> 1D blocks without requiring all dtypes to be extension dtypes).
>
> Those are all about BlockManager with 1D blocks. Once we only have 1D
> Blocks, I suppose there are many things we could simplify in the current
> BlockManager. The intermediate step of the current BlockManager with 1D
> blocks might not be an optimal situation, but seems the easiest as
> intermediate goal in practice.
>
> It probably also depends on how much "backwards compatibility" or
> "transition period" we want to provide.
>
>
>> On Mon, May 25, 2020 at 2:39 PM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> Hi list,
>>>
>>> Rewriting the BlockManager based on a simpler collection of 1D-arrays is
>>> actually on our roadmap (see here
>>> <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite>),
>>> and I also touched on it in a mailing list discussion about pandas 2.0
>>> earlier this year (see here
>>> <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>
>>> ).
>>>
>>> But since the topic came up again recently at the last online dev
>>> meeting (and also Uwe Korn who wrote a nice blog post
>>> <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about
>>> this yesterday), I thought to do a write-up of my thoughts on why I think
>>> we should actually move towards a simpler, non-consolidating BlockManager
>>> with 1D blocks.
>>>
>>>
>>> *Simplication of the internals*
>>>
>>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs)
>>> because right now we have a lot of special cases for 1D EAs in the
>>> internals. But to be clear: the additional complexity does not come from 1D
>>> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D
>>> blocks.
>>> Solving this would require a consistent block dimension, and thus
>>> removing this added complexity can be done in two ways: have all 1D blocks,
>>> or have all 2D blocks.
>>> Just to say: IMO, this is not an argument in favor of 2D blocks /
>>> consolidation.
>>>
>>> Moreover, when going with all 1D blocks, we cannot only remove the added
>>> complexity from dealing with the mixture of 1D/2D blocks, we will *also*
>>>  be able to reduce the complexity of dealing with 2D blocks. A
>>> BlockManager with 2D blocks is inherently more complex than with 1D blocks,
>>> as one needs to deal with proper alignment of the blocks, a more complex
>>> "placement" logic of the blocks, etc.
>>>
>>> I think we would be able to simplify the internals a lot by going with a
>>> BlockManager as a store of 1D arrays.
>>>
>>>
>>> *Performance*
>>>
>>> Performance is typically given as a reason to have consolidated, 2D
>>> blocks. And of course, certain operations (especially row-wise operations,
>>> or on dataframes with more columns as rows) will always be faster when done
>>> on a 2D numpy array under the hood.
>>> However, based on recent experimentation with this (eg triggered by the block-wise
>>> frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see
>>> also some benchmarks I justed posted in #10556
>>> <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160>
>>>  / this gist
>>> <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>),
>>> I also think that for many operations and with decent-sized dataframes,
>>> this performance penalty is actually quite OK.
>>>
>>> Further, there are also operations that will *benefit* from 1D blocks.
>>> First, operations that now involve aligning/splitting blocks,
>>> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing
>>> frame/frame operations column-wise is currently due to the consolidation in
>>> the end). And operations like adding a column, concatting (with axis=1) or
>>> merging dataframes will be much faster when no consolidation is needed.
>>>
>>> Personally, I am convinced that with some effort, we can get on-par or
>>> sometimes even better performance with 1D blocks compared to the
>>> performance we have now for those cases that 90+% of our users care about:
>>>
>>>    - With limited effort optimizing the column-wise code paths in the
>>>    internals, we can get a long way.
>>>    - After that, if needed, we can still consider if parts of the
>>>    internals could be cythonized to further improve certain bottlenecks (and
>>>    actually cythonizing this will also be simpler for a simpler
>>>    non-consolidating block manager).
>>>
>>>
>>> *Possibility to get better copy/view semantics*
>>>
>>> Pandas is badly known for how much it copies ("you need 10x the memory
>>> available as the size of your dataframe"), and having 1D blocks will allow
>>> us to address part of those concerns.
>>>
>>> *No consolidation = less copying.* Regularly consolidating introduces
>>> copies, and thus removing consolidation will mean less copies. For example,
>>> this would enable that you can actually add a single column to a dataframe
>>> without having to copy to the full dataframe.
>>>
>>> *Copy / view semantics* Recently there has been discussion again around
>>> whether selecting columns should be a copy or a view, and some other issues
>>> were opened with questions about views/copies when slicing columns. In the
>>> consolidated 2D block layout this will always be inherently messy, and
>>> unpredictable (meaning: depending on the actual block layout, which means
>>> in practice unpredictable for the user unaware of the block layout).
>>> Going with a non-consolidated BlockManager should at least allow us to
>>> get better / more understandable semantics around this.
>>>
>>>
>>> ------------------------------
>>>
>>> *So what are the reasons to have 2D blocks?*
>>>
>>> I personally don't directly see reasons to have 2D blocks *for pandas
>>> itself* (apart from performance in certain row-wise use cases, and
>>> except for the fact that we have "always done it like this"). But quite
>>> likely I am missing reasons, so please bring them up.
>>>
>>> But I think there are certainly use cases where 2D blocks can be useful,
>>> but typically "external" (but nonetheless important) use cases: conversion
>>> to/from numpy, xarray, etc. A typical example that has recently come up is
>>> scikit-learn, where they want to have a cheap dataframe <-> numpy array
>>> roundtrip for use in their pipelines.
>>> However, I personally think there are possible ways that we can still
>>> accommodate for those use cases, with some effort, while still having 1D
>>> Blocks in pandas itself. So IMO this is not sufficient to warrant the
>>> complexity of 2D blocks in pandas.
>>> (but will stop here, as this mail is getting already long ..).
>>>
>>> Joris
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/45f5b9c9/attachment-0001.html>

From tom.augspurger88 at gmail.com  Tue May 26 07:21:33 2020
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Tue, 26 May 2020 06:21:33 -0500
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
 <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
Message-ID: <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>

On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> Thanks for those links!
>
> Personally, I see the "roundtrip conversion to/from sparse matrices" a bit
> as in the same bucket as conversion to/from a 2D numpy array.
> Yes, both are important use cases. But the question we need to ask
> ourselves is still: is this important enough to hugely complicate the
> pandas' internals and block several other improvements? It's a trade-off
> that we need to make.
>
> Moreover, I think that we could accommodate the important part of those
> use cases also with a column-store DataFrame, with some effort (but with
> less complexity as a consolidated BlockManager).
>
> Focusing on scikit-learn: in the end, you mostly care about cheap
> roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame
> to carry feature labels in between steps of a pipeline, correct?
> Such cheap roundtripping is only possible anyway if you have a single
> dtype for all columns (which is typically the case after some
> transformation step). So you don't necessarily need consolidated blocks
> specifically, but rather the ability to store a *single* 2D array/matrix in
> a DataFrame (so kind of a single 2D block).
>
> Thinking out loud here, didn't try anything in code:
>
> - We could make the DataFrame construction from a 2D array/matrix kind of
> "lazy" (or have an option to do it like this): upon construction just store
> the 2D array as is, and only once you perform an actual operation on it,
> convert to a columnar store. And that would make it possible to still get
> the 2D array back with zero-copy, if all you did was passing this DataFrame
> to the next step of the pipeline.
> - We could take the above a step further and try to preserve the 2D array
> under the hood in some "easy" operations (but again, limited to a single 2D
> block/array, not multiple consolidated blocks). This is actually similar to
> the DataMatrix that pandas had a very long time ago. Of course this adds
> back complexity, so this would need some more exploration to see if how
> this would be possible (without duplicating a lot), and some buy-in from
> people interested in this.
>
> I think the first option should be fairly easy to do, and should solve a
> large part of the concerns for scikit-learn (I think?).
>

I think the first option would solve that use case for scikit-learn. It
sounds feasible, but I'm not sure how easy it would be.


> I think the second idea is also interesting: IMO such a data structure
> would be useful to have somewhere in the PyData ecosystem, and a worthwhile
> discussion to think about where this could fit. Maybe the answer is simply:
> use xarray for this use case (although there are still differences) ? That
> are interesting discussions, but personally I would not complicate the core
> pandas data model for heterogeneous dataframes to accommodate the
> single-dtype + fixed number of columns use case.
>

The current prototype[1] accepts preserves both xarray and pandas data
structures.

[1]: https://github.com/scikit-learn/scikit-learn/pull/16772


> Joris
>
> On Tue, 26 May 2020 at 09:50, Adrin <adrin.jalali at gmail.com> wrote:
>
>> Hi Joris,
>>
>> Thanks for the summary. I think another missing point is the roundtrip
>> conversion to/from sparse matrices.
>> There are some benchmarks and discussion here;
>> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097
>> and here's some discussion on the pandas issue tracker:
>> https://github.com/pandas-dev/pandas/issues/33182
>> and some benchmark by Tom, assuming pandas would accept a 2D sparse
>> array:
>> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615440896
>>
>> What do you think of these usecases?
>>
>> Thanks,
>> Adrin
>>
>> On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> Hi list,
>>>
>>> Rewriting the BlockManager based on a simpler collection of 1D-arrays is
>>> actually on our roadmap (see here
>>> <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite>),
>>> and I also touched on it in a mailing list discussion about pandas 2.0
>>> earlier this year (see here
>>> <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>
>>> ).
>>>
>>> But since the topic came up again recently at the last online dev
>>> meeting (and also Uwe Korn who wrote a nice blog post
>>> <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about
>>> this yesterday), I thought to do a write-up of my thoughts on why I think
>>> we should actually move towards a simpler, non-consolidating BlockManager
>>> with 1D blocks.
>>>
>>>
>>> *Simplication of the internals*
>>>
>>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs)
>>> because right now we have a lot of special cases for 1D EAs in the
>>> internals. But to be clear: the additional complexity does not come from 1D
>>> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D
>>> blocks.
>>> Solving this would require a consistent block dimension, and thus
>>> removing this added complexity can be done in two ways: have all 1D blocks,
>>> or have all 2D blocks.
>>> Just to say: IMO, this is not an argument in favor of 2D blocks /
>>> consolidation.
>>>
>>> Moreover, when going with all 1D blocks, we cannot only remove the added
>>> complexity from dealing with the mixture of 1D/2D blocks, we will *also*
>>>  be able to reduce the complexity of dealing with 2D blocks. A
>>> BlockManager with 2D blocks is inherently more complex than with 1D blocks,
>>> as one needs to deal with proper alignment of the blocks, a more complex
>>> "placement" logic of the blocks, etc.
>>>
>>> I think we would be able to simplify the internals a lot by going with a
>>> BlockManager as a store of 1D arrays.
>>>
>>>
>>> *Performance*
>>>
>>> Performance is typically given as a reason to have consolidated, 2D
>>> blocks. And of course, certain operations (especially row-wise operations,
>>> or on dataframes with more columns as rows) will always be faster when done
>>> on a 2D numpy array under the hood.
>>> However, based on recent experimentation with this (eg triggered by the block-wise
>>> frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see
>>> also some benchmarks I justed posted in #10556
>>> <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160>
>>>  / this gist
>>> <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>),
>>> I also think that for many operations and with decent-sized dataframes,
>>> this performance penalty is actually quite OK.
>>>
>>> Further, there are also operations that will *benefit* from 1D blocks.
>>> First, operations that now involve aligning/splitting blocks,
>>> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing
>>> frame/frame operations column-wise is currently due to the consolidation in
>>> the end). And operations like adding a column, concatting (with axis=1) or
>>> merging dataframes will be much faster when no consolidation is needed.
>>>
>>> Personally, I am convinced that with some effort, we can get on-par or
>>> sometimes even better performance with 1D blocks compared to the
>>> performance we have now for those cases that 90+% of our users care about:
>>>
>>>    - With limited effort optimizing the column-wise code paths in the
>>>    internals, we can get a long way.
>>>    - After that, if needed, we can still consider if parts of the
>>>    internals could be cythonized to further improve certain bottlenecks (and
>>>    actually cythonizing this will also be simpler for a simpler
>>>    non-consolidating block manager).
>>>
>>>
>>> *Possibility to get better copy/view semantics*
>>>
>>> Pandas is badly known for how much it copies ("you need 10x the memory
>>> available as the size of your dataframe"), and having 1D blocks will allow
>>> us to address part of those concerns.
>>>
>>> *No consolidation = less copying.* Regularly consolidating introduces
>>> copies, and thus removing consolidation will mean less copies. For example,
>>> this would enable that you can actually add a single column to a dataframe
>>> without having to copy to the full dataframe.
>>>
>>> *Copy / view semantics* Recently there has been discussion again around
>>> whether selecting columns should be a copy or a view, and some other issues
>>> were opened with questions about views/copies when slicing columns. In the
>>> consolidated 2D block layout this will always be inherently messy, and
>>> unpredictable (meaning: depending on the actual block layout, which means
>>> in practice unpredictable for the user unaware of the block layout).
>>> Going with a non-consolidated BlockManager should at least allow us to
>>> get better / more understandable semantics around this.
>>>
>>>
>>> ------------------------------
>>>
>>> *So what are the reasons to have 2D blocks?*
>>>
>>> I personally don't directly see reasons to have 2D blocks *for pandas
>>> itself* (apart from performance in certain row-wise use cases, and
>>> except for the fact that we have "always done it like this"). But quite
>>> likely I am missing reasons, so please bring them up.
>>>
>>> But I think there are certainly use cases where 2D blocks can be useful,
>>> but typically "external" (but nonetheless important) use cases: conversion
>>> to/from numpy, xarray, etc. A typical example that has recently come up is
>>> scikit-learn, where they want to have a cheap dataframe <-> numpy array
>>> roundtrip for use in their pipelines.
>>> However, I personally think there are possible ways that we can still
>>> accommodate for those use cases, with some effort, while still having 1D
>>> Blocks in pandas itself. So IMO this is not sufficient to warrant the
>>> complexity of 2D blocks in pandas.
>>> (but will stop here, as this mail is getting already long ..).
>>>
>>> Joris
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/b09c9ec7/attachment-0001.html>

From jeffreback at gmail.com  Tue May 26 08:16:53 2020
From: jeffreback at gmail.com (Jeff Reback)
Date: Tue, 26 May 2020 08:16:53 -0400
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
References: <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
Message-ID: <51BBA340-82F5-4F65-A51E-6527893678E0@gmail.com>

A little historical perspective

10 years ago the standard input to a Dataframe was a single dtype 2D numpy array. This provides the following nice properties:

- 0 cost construction, you can simply wrap Dataframe around the input with very little overhead. This provides a labeled array interface, gaining pandas users
- very fast reductions; the block is passed to numpy directly for the reductions; numpy can then reduce with aligned memory access
- almost all operations in pandas coerced to float64 on operations

The block manager is optimized for this case as this was the original DataMatrix. It serves its purpose pretty well. 

In the last few years things have changed in the following ways:

- dict of 1D numpy arrays is by far the most common construction
- heterogenous dtypes have grown quite a bit, eg it?s now very common to use int8, float32; these are also preserved pretty well by pandas operations 
- non numpy backed dtypes are increasingly common

To me removing the block manager is not about performance, rather about simplifying the code and mental model, though we should be mindful of construction from 2D inputs will require splitting and thus be not cheap (note that you can view the 1D slices but these are not memory aligned); this is a typical trap that folks get into; 1D looks all rosy but it all depends on usecase.

I think it would be ok for pandas to move to dict of columns and simply document the non performing cases (eg very wide single dtypes or 2D construction);

I suppose it?s also possible to reinvent the DataMatrix in a limited form but that of course adds complexity and would like to see that after a refactor.

my 3c

Jeff

On May 26, 2020, at 7:22 AM, Tom Augspurger <tom.augspurger88 at gmail.com> wrote:
> 
> ?
> 
> 
>> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
>> Thanks for those links!
>> 
>> Personally, I see the "roundtrip conversion to/from sparse matrices" a bit as in the same bucket as conversion to/from a 2D numpy array. 
>> Yes, both are important use cases. But the question we need to ask ourselves is still: is this important enough to hugely complicate the pandas' internals and block several other improvements? It's a trade-off that we need to make. 
>> 
>> Moreover, I think that we could accommodate the important part of those use cases also with a column-store DataFrame, with some effort (but with less complexity as a consolidated BlockManager). 
>> 
>> Focusing on scikit-learn: in the end, you mostly care about cheap roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame to carry feature labels in between steps of a pipeline, correct?
>> Such cheap roundtripping is only possible anyway if you have a single dtype for all columns (which is typically the case after some transformation step). So you don't necessarily need consolidated blocks specifically, but rather the ability to store a *single* 2D array/matrix in a DataFrame (so kind of a single 2D block).
>> 
>> Thinking out loud here, didn't try anything in code:
>> 
>> - We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline.
>> - We could take the above a step further and try to preserve the 2D array under the hood in some "easy" operations (but again, limited to a single 2D block/array, not multiple consolidated blocks). This is actually similar to the DataMatrix that pandas had a very long time ago. Of course this adds back complexity, so this would need some more exploration to see if how this would be possible (without duplicating a lot), and some buy-in from people interested in this.
>> 
>> I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?). 
> 
> I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be.
>  
>> I think the second idea is also interesting: IMO such a data structure would be useful to have somewhere in the PyData ecosystem, and a worthwhile discussion to think about where this could fit. Maybe the answer is simply: use xarray for this use case (although there are still differences) ? That are interesting discussions, but personally I would not complicate the core pandas data model for heterogeneous dataframes to accommodate the single-dtype + fixed number of columns use case.
> 
> The current prototype[1] accepts preserves both xarray and pandas data structures.
> 
> [1]: https://github.com/scikit-learn/scikit-learn/pull/16772
>  
>> Joris
>> 
>>> On Tue, 26 May 2020 at 09:50, Adrin <adrin.jalali at gmail.com> wrote:
>>> Hi Joris,
>>> 
>>> Thanks for the summary. I think another missing point is the roundtrip conversion to/from sparse matrices.
>>> There are some benchmarks and discussion here; https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097
>>> and here's some discussion on the pandas issue tracker: https://github.com/pandas-dev/pandas/issues/33182
>>> and some benchmark by Tom, assuming pandas would accept a 2D sparse array: https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615440896
>>> 
>>> What do you think of these usecases?
>>> 
>>> Thanks,
>>> Adrin
>>> 
>>>> On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
>>>> Hi list,
>>>> 
>>>> Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see here), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see here).
>>>> 
>>>> But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice blog post about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks.
>>>> 
>>>> 
>>>> 
>>>> Simplication of the internals
>>>> 
>>>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks.
>>>> Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks.
>>>> Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation.
>>>> 
>>>> Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will also be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc.
>>>> 
>>>> I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays.
>>>> 
>>>> 
>>>> 
>>>> Performance
>>>> 
>>>> Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood.
>>>> However, based on recent experimentation with this (eg triggered by the block-wise frame ops PR, and see also some benchmarks I justed posted in #10556 / this gist), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK.
>>>> 
>>>> Further, there are also operations that will benefit from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed.
>>>> 
>>>> Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about:
>>>> 
>>>> With limited effort optimizing the column-wise code paths in the internals, we can get a long way.
>>>> After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager).
>>>> 
>>>> Possibility to get better copy/view semantics
>>>> 
>>>> Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns.
>>>> 
>>>> No consolidation = less copying. Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe.
>>>> 
>>>> Copy / view semantics Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout).
>>>> Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this.
>>>> 
>>>> 
>>>> 
>>>> So what are the reasons to have 2D blocks?
>>>> 
>>>> I personally don't directly see reasons to have 2D blocks for pandas itself (apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up.
>>>> 
>>>> But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines.
>>>> However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. 
>>>> (but will stop here, as this mail is getting already long ..).
>>>> 
>>>> 
>>>> Joris
>>>> 
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/f8618bc9/attachment-0001.html>

From jbrockmendel at gmail.com  Tue May 26 10:13:44 2020
From: jbrockmendel at gmail.com (Brock Mendel)
Date: Tue, 26 May 2020 07:13:44 -0700
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <51BBA340-82F5-4F65-A51E-6527893678E0@gmail.com>
References: <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
 <51BBA340-82F5-4F65-A51E-6527893678E0@gmail.com>
Message-ID: <CAKf8g9SuyrVJW2xYvhRyepZFhu_1tYyAtsZTsQzWLg-p0VqGpA@mail.gmail.com>

>> Assuming we go down this path, do you have an idea of how we get from
here to there incrementally?  i.e. presumably this wont just be one massive
PR
>  [...] I would first like to focus on the "assuming we go down this path"
part. Let's discuss the pros and cons and trade-offs, and try to turn
assumptions in an agreed-upon roadmap. [...]

I think understanding the difficulty/feasibility of the implementation is a
pretty important part of the pros/cons.

Looking back at #10556, I'm wondering if we could disable _most_
consolidation, e.g. only consolidate when making copies anyway, which might
be a never-break-views policy.  From a user standpoint would that achieve
much/most of th benefits here?

On Tue, May 26, 2020 at 5:17 AM Jeff Reback <jeffreback at gmail.com> wrote:

> A little historical perspective
>
> 10 years ago the standard input to a Dataframe was a single dtype 2D numpy
> array. This provides the following nice properties:
>
> - 0 cost construction, you can simply wrap Dataframe around the input with
> very little overhead. This provides a labeled array interface, gaining
> pandas users
> - very fast reductions; the block is passed to numpy directly for the
> reductions; numpy can then reduce with aligned memory access
> - almost all operations in pandas coerced to float64 on operations
>
> The block manager is optimized for this case as this was the original
> DataMatrix. It serves its purpose pretty well.
>
> In the last few years things have changed in the following ways:
>
> - dict of 1D numpy arrays is by far the most common construction
> - heterogenous dtypes have grown quite a bit, eg it?s now very common to
> use int8, float32; these are also preserved pretty well by pandas
> operations
> - non numpy backed dtypes are increasingly common
>
> To me removing the block manager is not about performance, rather about
> simplifying the code and mental model, though we should be mindful of
> construction from 2D inputs will require splitting and thus be not cheap
> (note that you can view the 1D slices but these are not memory aligned);
> this is a typical trap that folks get into; 1D looks all rosy but it all
> depends on usecase.
>
> I think it would be ok for pandas to move to dict of columns and simply
> document the non performing cases (eg very wide single dtypes or 2D
> construction);
>
> I suppose it?s also possible to reinvent the DataMatrix in a limited form
> but that of course adds complexity and would like to see that after a
> refactor.
>
> my 3c
>
> Jeff
>
> On May 26, 2020, at 7:22 AM, Tom Augspurger <tom.augspurger88 at gmail.com>
> wrote:
>
>
> ?
>
>
> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> Thanks for those links!
>>
>> Personally, I see the "roundtrip conversion to/from sparse matrices" a
>> bit as in the same bucket as conversion to/from a 2D numpy array.
>> Yes, both are important use cases. But the question we need to ask
>> ourselves is still: is this important enough to hugely complicate the
>> pandas' internals and block several other improvements? It's a trade-off
>> that we need to make.
>>
>> Moreover, I think that we could accommodate the important part of those
>> use cases also with a column-store DataFrame, with some effort (but with
>> less complexity as a consolidated BlockManager).
>>
>> Focusing on scikit-learn: in the end, you mostly care about cheap
>> roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame
>> to carry feature labels in between steps of a pipeline, correct?
>> Such cheap roundtripping is only possible anyway if you have a single
>> dtype for all columns (which is typically the case after some
>> transformation step). So you don't necessarily need consolidated blocks
>> specifically, but rather the ability to store a *single* 2D array/matrix in
>> a DataFrame (so kind of a single 2D block).
>>
>> Thinking out loud here, didn't try anything in code:
>>
>> - We could make the DataFrame construction from a 2D array/matrix kind of
>> "lazy" (or have an option to do it like this): upon construction just store
>> the 2D array as is, and only once you perform an actual operation on it,
>> convert to a columnar store. And that would make it possible to still get
>> the 2D array back with zero-copy, if all you did was passing this DataFrame
>> to the next step of the pipeline.
>> - We could take the above a step further and try to preserve the 2D array
>> under the hood in some "easy" operations (but again, limited to a single 2D
>> block/array, not multiple consolidated blocks). This is actually similar to
>> the DataMatrix that pandas had a very long time ago. Of course this adds
>> back complexity, so this would need some more exploration to see if how
>> this would be possible (without duplicating a lot), and some buy-in from
>> people interested in this.
>>
>> I think the first option should be fairly easy to do, and should solve a
>> large part of the concerns for scikit-learn (I think?).
>>
>
> I think the first option would solve that use case for scikit-learn. It
> sounds feasible, but I'm not sure how easy it would be.
>
>
>> I think the second idea is also interesting: IMO such a data structure
>> would be useful to have somewhere in the PyData ecosystem, and a worthwhile
>> discussion to think about where this could fit. Maybe the answer is simply:
>> use xarray for this use case (although there are still differences) ? That
>> are interesting discussions, but personally I would not complicate the core
>> pandas data model for heterogeneous dataframes to accommodate the
>> single-dtype + fixed number of columns use case.
>>
>
> The current prototype[1] accepts preserves both xarray and pandas data
> structures.
>
> [1]: https://github.com/scikit-learn/scikit-learn/pull/16772
>
>
>> Joris
>>
>> On Tue, 26 May 2020 at 09:50, Adrin <adrin.jalali at gmail.com> wrote:
>>
>>> Hi Joris,
>>>
>>> Thanks for the summary. I think another missing point is the roundtrip
>>> conversion to/from sparse matrices.
>>> There are some benchmarks and discussion here;
>>> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097
>>> and here's some discussion on the pandas issue tracker:
>>> https://github.com/pandas-dev/pandas/issues/33182
>>> and some benchmark by Tom, assuming pandas would accept a 2D sparse
>>> array:
>>> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615440896
>>>
>>> What do you think of these usecases?
>>>
>>> Thanks,
>>> Adrin
>>>
>>> On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche <
>>> jorisvandenbossche at gmail.com> wrote:
>>>
>>>> Hi list,
>>>>
>>>> Rewriting the BlockManager based on a simpler collection of 1D-arrays
>>>> is actually on our roadmap (see here
>>>> <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite>),
>>>> and I also touched on it in a mailing list discussion about pandas 2.0
>>>> earlier this year (see here
>>>> <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>
>>>> ).
>>>>
>>>> But since the topic came up again recently at the last online dev
>>>> meeting (and also Uwe Korn who wrote a nice blog post
>>>> <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about
>>>> this yesterday), I thought to do a write-up of my thoughts on why I think
>>>> we should actually move towards a simpler, non-consolidating BlockManager
>>>> with 1D blocks.
>>>>
>>>>
>>>> *Simplication of the internals*
>>>>
>>>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs)
>>>> because right now we have a lot of special cases for 1D EAs in the
>>>> internals. But to be clear: the additional complexity does not come from 1D
>>>> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D
>>>> blocks.
>>>> Solving this would require a consistent block dimension, and thus
>>>> removing this added complexity can be done in two ways: have all 1D blocks,
>>>> or have all 2D blocks.
>>>> Just to say: IMO, this is not an argument in favor of 2D blocks /
>>>> consolidation.
>>>>
>>>> Moreover, when going with all 1D blocks, we cannot only remove the
>>>> added complexity from dealing with the mixture of 1D/2D blocks, we will
>>>>  *also* be able to reduce the complexity of dealing with 2D blocks. A
>>>> BlockManager with 2D blocks is inherently more complex than with 1D blocks,
>>>> as one needs to deal with proper alignment of the blocks, a more complex
>>>> "placement" logic of the blocks, etc.
>>>>
>>>> I think we would be able to simplify the internals a lot by going with
>>>> a BlockManager as a store of 1D arrays.
>>>>
>>>>
>>>> *Performance*
>>>>
>>>> Performance is typically given as a reason to have consolidated, 2D
>>>> blocks. And of course, certain operations (especially row-wise operations,
>>>> or on dataframes with more columns as rows) will always be faster when done
>>>> on a 2D numpy array under the hood.
>>>> However, based on recent experimentation with this (eg triggered by the
>>>>  block-wise frame ops PR
>>>> <https://github.com/pandas-dev/pandas/pull/32779>, and see also some
>>>> benchmarks I justed posted in #10556
>>>> <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160>
>>>>  / this gist
>>>> <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>),
>>>> I also think that for many operations and with decent-sized dataframes,
>>>> this performance penalty is actually quite OK.
>>>>
>>>> Further, there are also operations that will *benefit* from 1D blocks.
>>>> First, operations that now involve aligning/splitting blocks,
>>>> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing
>>>> frame/frame operations column-wise is currently due to the consolidation in
>>>> the end). And operations like adding a column, concatting (with axis=1) or
>>>> merging dataframes will be much faster when no consolidation is needed.
>>>>
>>>> Personally, I am convinced that with some effort, we can get on-par or
>>>> sometimes even better performance with 1D blocks compared to the
>>>> performance we have now for those cases that 90+% of our users care about:
>>>>
>>>>    - With limited effort optimizing the column-wise code paths in the
>>>>    internals, we can get a long way.
>>>>    - After that, if needed, we can still consider if parts of the
>>>>    internals could be cythonized to further improve certain bottlenecks (and
>>>>    actually cythonizing this will also be simpler for a simpler
>>>>    non-consolidating block manager).
>>>>
>>>>
>>>> *Possibility to get better copy/view semantics*
>>>>
>>>> Pandas is badly known for how much it copies ("you need 10x the memory
>>>> available as the size of your dataframe"), and having 1D blocks will allow
>>>> us to address part of those concerns.
>>>>
>>>> *No consolidation = less copying.* Regularly consolidating introduces
>>>> copies, and thus removing consolidation will mean less copies. For example,
>>>> this would enable that you can actually add a single column to a dataframe
>>>> without having to copy to the full dataframe.
>>>>
>>>> *Copy / view semantics* Recently there has been discussion again
>>>> around whether selecting columns should be a copy or a view, and some other
>>>> issues were opened with questions about views/copies when slicing columns.
>>>> In the consolidated 2D block layout this will always be inherently messy,
>>>> and unpredictable (meaning: depending on the actual block layout, which
>>>> means in practice unpredictable for the user unaware of the block layout).
>>>> Going with a non-consolidated BlockManager should at least allow us to
>>>> get better / more understandable semantics around this.
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> *So what are the reasons to have 2D blocks?*
>>>>
>>>> I personally don't directly see reasons to have 2D blocks *for pandas
>>>> itself* (apart from performance in certain row-wise use cases, and
>>>> except for the fact that we have "always done it like this"). But quite
>>>> likely I am missing reasons, so please bring them up.
>>>>
>>>> But I think there are certainly use cases where 2D blocks can be
>>>> useful, but typically "external" (but nonetheless important) use cases:
>>>> conversion to/from numpy, xarray, etc. A typical example that has recently
>>>> come up is scikit-learn, where they want to have a cheap dataframe <->
>>>> numpy array roundtrip for use in their pipelines.
>>>> However, I personally think there are possible ways that we can still
>>>> accommodate for those use cases, with some effort, while still having 1D
>>>> Blocks in pandas itself. So IMO this is not sufficient to warrant the
>>>> complexity of 2D blocks in pandas.
>>>> (but will stop here, as this mail is getting already long ..).
>>>>
>>>> Joris
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/a4df1701/attachment-0001.html>

From jorisvandenbossche at gmail.com  Tue May 26 15:34:52 2020
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Tue, 26 May 2020 21:34:52 +0200
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
 <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
 <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
Message-ID: <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>

On Tue, 26 May 2020 at 13:21, Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

>
> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> - We could make the DataFrame construction from a 2D array/matrix kind of
>> "lazy" (or have an option to do it like this): upon construction just store
>> the 2D array as is, and only once you perform an actual operation on it,
>> convert to a columnar store. And that would make it possible to still get
>> the 2D array back with zero-copy, if all you did was passing this DataFrame
>> to the next step of the pipeline.
>>
>> I think the first option should be fairly easy to do, and should solve a
>> large part of the concerns for scikit-learn (I think?).
>>
>
> I think the first option would solve that use case for scikit-learn. It
> sounds feasible, but I'm not sure how easy it would be.
>
>
A quick, ugly proof-of-concept:
https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188

It allows to create a "DataFrame" from an ndarray without creating a
BlockManager, and it allows accessing this original ndarray:

In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
(pd.RangeIndex(4), pd.RangeIndex(3)))

In [2]: df._mgr_data
Out[2]:
(array([[ 1.52971972e-01, -5.69204971e-01,  5.54430115e-01],
        [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00],
        [ 7.05185110e-01, -1.53009348e-03,  1.54260335e+00],
        [-4.60590231e-01, -3.85364427e-01,  1.80760103e+00]]),
 RangeIndex(start=0, stop=4, step=1),
 RangeIndex(start=0, stop=3, step=1))

And once you do something with the dataframe, such as printing or
calculating something, the BlockManager gets only created at this step:

In [3]: df
Out[3]: Initializing !!!

          0         1         2
0  0.152972 -0.569205  0.554430
1 -1.099161 -1.163154 -1.510711
2  0.705185 -0.001530  1.542603
3 -0.460590 -0.385364  1.807601

In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
(pd.RangeIndex(4), pd.RangeIndex(3)))

In [5]: df.mean()
Initializing !!!
Out[5]:
0    0.397243
1    0.269996
2   -0.454929
dtype: float64

There are of course many things missing (validation of the input to
init_lazy, potentially being able to access df.index/df.columns without
initializing the block manager, hooking this up in __array__, what with
pickling?, ...)
But just to illustrate the idea.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/315a6110/attachment.html>

From tom.augspurger88 at gmail.com  Tue May 26 15:42:44 2020
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Tue, 26 May 2020 14:42:44 -0500
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
 <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
 <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
 <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>
Message-ID: <CAE1aY-=WpvM+rE9Po+_XDLBSX958mPsD1L6W3Zz24E_-ehcWBw@mail.gmail.com>

Thanks for verifying the feasibility. Validation is a bit tricky, but I'd
hope that we can delay everything except the splitting / forming of blocks.
That may result in some non-obvious performance quirks, but at least of the
simple case of `data` being an ndarray and index / columns not forcing any
reindexing, I'm hopeful that it's not too bad.

On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> On Tue, 26 May 2020 at 13:21, Tom Augspurger <tom.augspurger88 at gmail.com>
> wrote:
>
>>
>> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> - We could make the DataFrame construction from a 2D array/matrix kind
>>> of "lazy" (or have an option to do it like this): upon construction just
>>> store the 2D array as is, and only once you perform an actual operation on
>>> it, convert to a columnar store. And that would make it possible to still
>>> get the 2D array back with zero-copy, if all you did was passing this
>>> DataFrame to the next step of the pipeline.
>>>
>>> I think the first option should be fairly easy to do, and should solve a
>>> large part of the concerns for scikit-learn (I think?).
>>>
>>
>> I think the first option would solve that use case for scikit-learn. It
>> sounds feasible, but I'm not sure how easy it would be.
>>
>>
> A quick, ugly proof-of-concept:
> https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188
>
> It allows to create a "DataFrame" from an ndarray without creating a
> BlockManager, and it allows accessing this original ndarray:
>
> In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
> (pd.RangeIndex(4), pd.RangeIndex(3)))
>
> In [2]: df._mgr_data
> Out[2]:
> (array([[ 1.52971972e-01, -5.69204971e-01,  5.54430115e-01],
>         [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00],
>         [ 7.05185110e-01, -1.53009348e-03,  1.54260335e+00],
>         [-4.60590231e-01, -3.85364427e-01,  1.80760103e+00]]),
>  RangeIndex(start=0, stop=4, step=1),
>  RangeIndex(start=0, stop=3, step=1))
>
> And once you do something with the dataframe, such as printing or
> calculating something, the BlockManager gets only created at this step:
>
> In [3]: df
> Out[3]: Initializing !!!
>
>           0         1         2
> 0  0.152972 -0.569205  0.554430
> 1 -1.099161 -1.163154 -1.510711
> 2  0.705185 -0.001530  1.542603
> 3 -0.460590 -0.385364  1.807601
>
> In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
> (pd.RangeIndex(4), pd.RangeIndex(3)))
>
> In [5]: df.mean()
> Initializing !!!
> Out[5]:
> 0    0.397243
> 1    0.269996
> 2   -0.454929
> dtype: float64
>
> There are of course many things missing (validation of the input to
> init_lazy, potentially being able to access df.index/df.columns without
> initializing the block manager, hooking this up in __array__, what with
> pickling?, ...)
> But just to illustrate the idea.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/6d3563cc/attachment.html>

From jorisvandenbossche at gmail.com  Tue May 26 15:44:43 2020
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Tue, 26 May 2020 21:44:43 +0200
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CAKf8g9SuyrVJW2xYvhRyepZFhu_1tYyAtsZTsQzWLg-p0VqGpA@mail.gmail.com>
References: <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
 <51BBA340-82F5-4F65-A51E-6527893678E0@gmail.com>
 <CAKf8g9SuyrVJW2xYvhRyepZFhu_1tYyAtsZTsQzWLg-p0VqGpA@mail.gmail.com>
Message-ID: <CALQtMBZi99pkCNJmSYTPy92T798T3UxmtbMVnKmggL7kvx2+Bw@mail.gmail.com>

On Tue, 26 May 2020 at 16:14, Brock Mendel <jbrockmendel at gmail.com> wrote:

> >> Assuming we go down this path, do you have an idea of how we get from
> here to there incrementally?  i.e. presumably this wont just be one massive
> PR
> >  [...] I would first like to focus on the "assuming we go down this
> path" part. Let's discuss the pros and cons and trade-offs, and try to turn
> assumptions in an agreed-upon roadmap. [...]
>
> I think understanding the difficulty/feasibility of the implementation is
> a pretty important part of the pros/cons.
>

That's true. Personally I think there are enough options to do it to not
have to worry about the "how" too much, but for sure it will be a lot of
work to do it properly (so rather the "who is going to do this").


> Looking back at #10556, I'm wondering if we could disable _most_
> consolidation, e.g. only consolidate when making copies anyway, which might
> be a never-break-views policy.  From a user standpoint would that achieve
> much/most of th benefits here?
>

That could certainly alleviate some of the drawbacks of the consolidated
BlockManager regarding its copying behaviour (but not necessarily regarding
the transparency / understandability of it, I would say).
But for example for the "complexity of the internals" argument, I think
this would rather make it worse. Now, you at least know (after ensuring
consolidation) that you have only a single block for a certain dtype. Still
having many, potentially-but-not-always consolidated 2D blocks will make it
more difficult to optimize the situation of non-consolidated / 1D blocks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/6071f76f/attachment-0001.html>

From wesmckinn at gmail.com  Tue May 26 15:48:39 2020
From: wesmckinn at gmail.com (Wes McKinney)
Date: Tue, 26 May 2020 14:48:39 -0500
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
 <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
 <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
 <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>
Message-ID: <CAJPUwMBOgHEgdODX0vOSXQs3frpJ-y58WFcV=b9gS6fCTrxe7w@mail.gmail.com>

Something to add here (in favor of removing the BM) -- and apologies
if it's already mentioned in a different form:

It is very, very difficult for third party code to construct
heterogeneously-typed DataFrames without triggering a memory doubling.
To give you an example what I mean, in Apache Arrow, we painstakingly
implemented block consolidation in C++ [1] so that we can construct a
DataFrame that won't suddenly double memory the first time that a user
interacts with it. So the possibility of users having an OOM on their
first interaction with an object they created is not great. If
avoiding it for library developers were easy then perhaps it would be
less of an issue, but avoiding the doubling requires advanced
knowledge of pandas's internals.

Looking back 9-10 years, the primary motivations I had for creating
the BlockManager in the first place don't persuade me anymore:

* pandas's success was still very much coupled to vectorized
operations on wide row-major data (e.g. as present in certain sectors
of the financial industry). I don't think this represents the majority
of pandas users now
* In 2011 I was uncomfortable writing significant compiled code. Many
of the performance issues that the BM tried to ameliorate are
non-issues if you're OK writing non-trivial C/C++ code to deal with
row-level interactions. Even if there were a 50% performance
regression on some of these operations that are faster with 2D blocks
because of row-major vs. column-major memory layout, that still seems
worth it for the vast code simplification and the
memory-use-predictability benefits that others have articulated
already.

- Wes

[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc

On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche
<jorisvandenbossche at gmail.com> wrote:
>
> On Tue, 26 May 2020 at 13:21, Tom Augspurger <tom.augspurger88 at gmail.com> wrote:
>>
>>
>> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
>>>
>>> - We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline.
>>>
>>> I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?).
>>
>>
>> I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be.
>>
>
> A quick, ugly proof-of-concept: https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188
>
> It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray:
>
> In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3)))
>
> In [2]: df._mgr_data
> Out[2]:
> (array([[ 1.52971972e-01, -5.69204971e-01,  5.54430115e-01],
>         [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00],
>         [ 7.05185110e-01, -1.53009348e-03,  1.54260335e+00],
>         [-4.60590231e-01, -3.85364427e-01,  1.80760103e+00]]),
>  RangeIndex(start=0, stop=4, step=1),
>  RangeIndex(start=0, stop=3, step=1))
>
> And once you do something with the dataframe, such as printing or calculating something, the BlockManager gets only created at this step:
>
> In [3]: df
> Out[3]: Initializing !!!
>
>           0         1         2
> 0  0.152972 -0.569205  0.554430
> 1 -1.099161 -1.163154 -1.510711
> 2  0.705185 -0.001530  1.542603
> 3 -0.460590 -0.385364  1.807601
>
> In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3)))
>
> In [5]: df.mean()
> Initializing !!!
> Out[5]:
> 0    0.397243
> 1    0.269996
> 2   -0.454929
> dtype: float64
>
> There are of course many things missing (validation of the input to init_lazy, potentially being able to access df.index/df.columns without initializing the block manager, hooking this up in __array__, what with pickling?, ...)
> But just to illustrate the idea.
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev

From jbrockmendel at gmail.com  Tue May 26 16:49:41 2020
From: jbrockmendel at gmail.com (Brock Mendel)
Date: Tue, 26 May 2020 13:49:41 -0700
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CAJPUwMBOgHEgdODX0vOSXQs3frpJ-y58WFcV=b9gS6fCTrxe7w@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
 <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
 <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
 <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>
 <CAJPUwMBOgHEgdODX0vOSXQs3frpJ-y58WFcV=b9gS6fCTrxe7w@mail.gmail.com>
Message-ID: <CAKf8g9TzEdnv67C4qQ4fWe874eS7UAkfJtS2_DhHC3AfHDRmqQ@mail.gmail.com>

> It allows to create a "DataFrame" from an ndarray without creating a
BlockManager, and it allows accessing this original ndarray:

This is a neat proof of concept, but it cuts against the "decreases
complexity" argument.  Is there a viable way to quantify (even very
roughly) the complexity effect of going all-1D?

A couple ideas for ways to simplify this decision-making problem:

1) ATM there are a handful of places outside of core.internals where we
call consolidate/consolidate_inplace.  If we can refactor those away, we
can focus on the BlockManager in (closer-to-)isolation.

2) IIUC going all-1D will cause column indexing to always return views.
Elsewhere you have noted that this is a breaking API change which merited
discussion in its own right.  xref #33780
<https://github.com/pandas-dev/pandas/issues/33780 >.  My takeaway from
this part of the last dev call was that people were generally positive on
the all-views idea, but were wary of how to handle the potential
deprecation.


On Tue, May 26, 2020 at 12:49 PM Wes McKinney <wesmckinn at gmail.com> wrote:

> Something to add here (in favor of removing the BM) -- and apologies
> if it's already mentioned in a different form:
>
> It is very, very difficult for third party code to construct
> heterogeneously-typed DataFrames without triggering a memory doubling.
> To give you an example what I mean, in Apache Arrow, we painstakingly
> implemented block consolidation in C++ [1] so that we can construct a
> DataFrame that won't suddenly double memory the first time that a user
> interacts with it. So the possibility of users having an OOM on their
> first interaction with an object they created is not great. If
> avoiding it for library developers were easy then perhaps it would be
> less of an issue, but avoiding the doubling requires advanced
> knowledge of pandas's internals.
>
> Looking back 9-10 years, the primary motivations I had for creating
> the BlockManager in the first place don't persuade me anymore:
>
> * pandas's success was still very much coupled to vectorized
> operations on wide row-major data (e.g. as present in certain sectors
> of the financial industry). I don't think this represents the majority
> of pandas users now
> * In 2011 I was uncomfortable writing significant compiled code. Many
> of the performance issues that the BM tried to ameliorate are
> non-issues if you're OK writing non-trivial C/C++ code to deal with
> row-level interactions. Even if there were a 50% performance
> regression on some of these operations that are faster with 2D blocks
> because of row-major vs. column-major memory layout, that still seems
> worth it for the vast code simplification and the
> memory-use-predictability benefits that others have articulated
> already.
>
> - Wes
>
> [1]:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc
>
> On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche
> <jorisvandenbossche at gmail.com> wrote:
> >
> > On Tue, 26 May 2020 at 13:21, Tom Augspurger <tom.augspurger88 at gmail.com>
> wrote:
> >>
> >>
> >> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
> >>>
> >>> - We could make the DataFrame construction from a 2D array/matrix kind
> of "lazy" (or have an option to do it like this): upon construction just
> store the 2D array as is, and only once you perform an actual operation on
> it, convert to a columnar store. And that would make it possible to still
> get the 2D array back with zero-copy, if all you did was passing this
> DataFrame to the next step of the pipeline.
> >>>
> >>> I think the first option should be fairly easy to do, and should solve
> a large part of the concerns for scikit-learn (I think?).
> >>
> >>
> >> I think the first option would solve that use case for scikit-learn. It
> sounds feasible, but I'm not sure how easy it would be.
> >>
> >
> > A quick, ugly proof-of-concept:
> https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188
> >
> > It allows to create a "DataFrame" from an ndarray without creating a
> BlockManager, and it allows accessing this original ndarray:
> >
> > In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
> (pd.RangeIndex(4), pd.RangeIndex(3)))
> >
> > In [2]: df._mgr_data
> > Out[2]:
> > (array([[ 1.52971972e-01, -5.69204971e-01,  5.54430115e-01],
> >         [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00],
> >         [ 7.05185110e-01, -1.53009348e-03,  1.54260335e+00],
> >         [-4.60590231e-01, -3.85364427e-01,  1.80760103e+00]]),
> >  RangeIndex(start=0, stop=4, step=1),
> >  RangeIndex(start=0, stop=3, step=1))
> >
> > And once you do something with the dataframe, such as printing or
> calculating something, the BlockManager gets only created at this step:
> >
> > In [3]: df
> > Out[3]: Initializing !!!
> >
> >           0         1         2
> > 0  0.152972 -0.569205  0.554430
> > 1 -1.099161 -1.163154 -1.510711
> > 2  0.705185 -0.001530  1.542603
> > 3 -0.460590 -0.385364  1.807601
> >
> > In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
> (pd.RangeIndex(4), pd.RangeIndex(3)))
> >
> > In [5]: df.mean()
> > Initializing !!!
> > Out[5]:
> > 0    0.397243
> > 1    0.269996
> > 2   -0.454929
> > dtype: float64
> >
> > There are of course many things missing (validation of the input to
> init_lazy, potentially being able to access df.index/df.columns without
> initializing the block manager, hooking this up in __array__, what with
> pickling?, ...)
> > But just to illustrate the idea.
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > https://mail.python.org/mailman/listinfo/pandas-dev
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/85f00476/attachment-0001.html>

From tom.augspurger88 at gmail.com  Tue May 26 16:58:17 2020
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Tue, 26 May 2020 15:58:17 -0500
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CAKf8g9TzEdnv67C4qQ4fWe874eS7UAkfJtS2_DhHC3AfHDRmqQ@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
 <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
 <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
 <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>
 <CAJPUwMBOgHEgdODX0vOSXQs3frpJ-y58WFcV=b9gS6fCTrxe7w@mail.gmail.com>
 <CAKf8g9TzEdnv67C4qQ4fWe874eS7UAkfJtS2_DhHC3AfHDRmqQ@mail.gmail.com>
Message-ID: <CAE1aY-m7tHZJvGODPmmJ=Xa3==e_6=3KeuqOEcvOv7SfQxQPcw@mail.gmail.com>

On Tue, May 26, 2020 at 3:50 PM Brock Mendel <jbrockmendel at gmail.com> wrote:

> > It allows to create a "DataFrame" from an ndarray without creating a
> BlockManager, and it allows accessing this original ndarray:
>
> This is a neat proof of concept, but it cuts against the "decreases
> complexity" argument.  Is there a viable way to quantify (even very
> roughly) the complexity effect of going all-1D?
>

That complexity is at least localized to a single attribute. That's quite
different from the 1D & 2D blocks situation, where many methods (though
fewer than a year ago) need to be concerned with whether the array in a
block is 1D or 2D, or whether the DataFrame is consolidated, homogenous, ...


> A couple ideas for ways to simplify this decision-making problem:
>
> 1) ATM there are a handful of places outside of core.internals where we
> call consolidate/consolidate_inplace.  If we can refactor those away, we
> can focus on the BlockManager in (closer-to-)isolation.
>

If possible, isolating consolidation to `core.internals` sounds like a
generally useful cleanup, regardless of whether we pursue the larger
changes.


> 2) IIUC going all-1D will cause column indexing to always return views.
> Elsewhere you have noted that this is a breaking API change which merited
> discussion in its own right.  xref #33780
> <https://github.com/pandas-dev/pandas/issues/33780>.  My takeaway from
> this part of the last dev call was that people were generally positive on
> the all-views idea, but were wary of how to handle the potential
> deprecation.
>

This type of change would merit a major version bump. If possible, we'd
ideally have some kind of option to disable consolidation / enable
splitting, which would allow for users to test their code on older versions.


> On Tue, May 26, 2020 at 12:49 PM Wes McKinney <wesmckinn at gmail.com> wrote:
>
>> Something to add here (in favor of removing the BM) -- and apologies
>> if it's already mentioned in a different form:
>>
>> It is very, very difficult for third party code to construct
>> heterogeneously-typed DataFrames without triggering a memory doubling.
>> To give you an example what I mean, in Apache Arrow, we painstakingly
>> implemented block consolidation in C++ [1] so that we can construct a
>> DataFrame that won't suddenly double memory the first time that a user
>> interacts with it. So the possibility of users having an OOM on their
>> first interaction with an object they created is not great. If
>> avoiding it for library developers were easy then perhaps it would be
>> less of an issue, but avoiding the doubling requires advanced
>> knowledge of pandas's internals.
>>
>> Looking back 9-10 years, the primary motivations I had for creating
>> the BlockManager in the first place don't persuade me anymore:
>>
>> * pandas's success was still very much coupled to vectorized
>> operations on wide row-major data (e.g. as present in certain sectors
>> of the financial industry). I don't think this represents the majority
>> of pandas users now
>> * In 2011 I was uncomfortable writing significant compiled code. Many
>> of the performance issues that the BM tried to ameliorate are
>> non-issues if you're OK writing non-trivial C/C++ code to deal with
>> row-level interactions. Even if there were a 50% performance
>> regression on some of these operations that are faster with 2D blocks
>> because of row-major vs. column-major memory layout, that still seems
>> worth it for the vast code simplification and the
>> memory-use-predictability benefits that others have articulated
>> already.
>>
>> - Wes
>>
>> [1]:
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc
>>
>> On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche
>> <jorisvandenbossche at gmail.com> wrote:
>> >
>> > On Tue, 26 May 2020 at 13:21, Tom Augspurger <
>> tom.augspurger88 at gmail.com> wrote:
>> >>
>> >>
>> >> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>> >>>
>> >>> - We could make the DataFrame construction from a 2D array/matrix
>> kind of "lazy" (or have an option to do it like this): upon construction
>> just store the 2D array as is, and only once you perform an actual
>> operation on it, convert to a columnar store. And that would make it
>> possible to still get the 2D array back with zero-copy, if all you did was
>> passing this DataFrame to the next step of the pipeline.
>> >>>
>> >>> I think the first option should be fairly easy to do, and should
>> solve a large part of the concerns for scikit-learn (I think?).
>> >>
>> >>
>> >> I think the first option would solve that use case for scikit-learn.
>> It sounds feasible, but I'm not sure how easy it would be.
>> >>
>> >
>> > A quick, ugly proof-of-concept:
>> https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188
>> >
>> > It allows to create a "DataFrame" from an ndarray without creating a
>> BlockManager, and it allows accessing this original ndarray:
>> >
>> > In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
>> (pd.RangeIndex(4), pd.RangeIndex(3)))
>> >
>> > In [2]: df._mgr_data
>> > Out[2]:
>> > (array([[ 1.52971972e-01, -5.69204971e-01,  5.54430115e-01],
>> >         [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00],
>> >         [ 7.05185110e-01, -1.53009348e-03,  1.54260335e+00],
>> >         [-4.60590231e-01, -3.85364427e-01,  1.80760103e+00]]),
>> >  RangeIndex(start=0, stop=4, step=1),
>> >  RangeIndex(start=0, stop=3, step=1))
>> >
>> > And once you do something with the dataframe, such as printing or
>> calculating something, the BlockManager gets only created at this step:
>> >
>> > In [3]: df
>> > Out[3]: Initializing !!!
>> >
>> >           0         1         2
>> > 0  0.152972 -0.569205  0.554430
>> > 1 -1.099161 -1.163154 -1.510711
>> > 2  0.705185 -0.001530  1.542603
>> > 3 -0.460590 -0.385364  1.807601
>> >
>> > In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
>> (pd.RangeIndex(4), pd.RangeIndex(3)))
>> >
>> > In [5]: df.mean()
>> > Initializing !!!
>> > Out[5]:
>> > 0    0.397243
>> > 1    0.269996
>> > 2   -0.454929
>> > dtype: float64
>> >
>> > There are of course many things missing (validation of the input to
>> init_lazy, potentially being able to access df.index/df.columns without
>> initializing the block manager, hooking this up in __array__, what with
>> pickling?, ...)
>> > But just to illustrate the idea.
>> > _______________________________________________
>> > Pandas-dev mailing list
>> > Pandas-dev at python.org
>> > https://mail.python.org/mailman/listinfo/pandas-dev
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/45581dda/attachment.html>

From jorisvandenbossche at gmail.com  Wed May 27 15:57:28 2020
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Wed, 27 May 2020 21:57:28 +0200
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CAE1aY-m7tHZJvGODPmmJ=Xa3==e_6=3KeuqOEcvOv7SfQxQPcw@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
 <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
 <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
 <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>
 <CAJPUwMBOgHEgdODX0vOSXQs3frpJ-y58WFcV=b9gS6fCTrxe7w@mail.gmail.com>
 <CAKf8g9TzEdnv67C4qQ4fWe874eS7UAkfJtS2_DhHC3AfHDRmqQ@mail.gmail.com>
 <CAE1aY-m7tHZJvGODPmmJ=Xa3==e_6=3KeuqOEcvOv7SfQxQPcw@mail.gmail.com>
Message-ID: <CALQtMBZGo9FKmx1AoH-JhnGpCH0eK5rsHvju4OZg5X9ZTiRjDw@mail.gmail.com>

On Tue, 26 May 2020 at 23:00, Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

>
> On Tue, May 26, 2020 at 3:50 PM Brock Mendel <jbrockmendel at gmail.com>
> wrote:
>
>> > It allows to create a "DataFrame" from an ndarray without creating a
>> BlockManager, and it allows accessing this original ndarray:
>>
>> This is a neat proof of concept, but it cuts against the "decreases
>> complexity" argument.  Is there a viable way to quantify (even very
>> roughly) the complexity effect of going all-1D?
>>
>
> That complexity is at least localized to a single attribute. That's quite
> different from the 1D & 2D blocks situation, where many methods (though
> fewer than a year ago) need to be concerned with whether the array in a
> block is 1D or 2D, or whether the DataFrame is consolidated, homogenous, ...
>
>
I don't think this "lazy _mgr attribute" is comparable in complexity with
the consolidated BlockManager. Furthermore: it's targeted to a very
specific and limited use case (and eg also doesn't need to be the default,
I think).
Now, exactly quantifying the effect of going all-1D, that's of course hard.
But just one example: all code that deals with blknos/blklocs (the mapping
between the position in the consolidated blocks and the position in the
dataframe), which is a significant part of managers.py, could be simplified
considerably.

But anyway: I think it clear that a BlockManager with only 1D arrays/blocks
*can* be simpler as one with interleaved/consolidated blocks. But this is
also only one of the arguments. Complexity alone is not a reason to not do
something; it's the general trade-off with what you gain or lose with it.


> A couple ideas for ways to simplify this decision-making problem:
>>
>
>
>> 2) IIUC going all-1D will cause column indexing to always return views.
>> Elsewhere you have noted that this is a breaking API change which merited
>> discussion in its own right.  xref #33780
>> <https://github.com/pandas-dev/pandas/issues/33780>.  My takeaway from
>> this part of the last dev call was that people were generally positive on
>> the all-views idea, but were wary of how to handle the potential
>> deprecation.
>>
>
> This type of change would merit a major version bump. If possible, we'd
> ideally have some kind of option to disable consolidation / enable
> splitting, which would allow for users to test their code on older versions.
>

Yes, going to an all-1D-BlockManager would be something for a major version
bump, eg pandas 2.0. So I think that is the perfect opportunity to do such
a change of making column selections always views.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200527/90994ead/attachment.html>

From jbrockmendel at gmail.com  Wed May 27 17:07:41 2020
From: jbrockmendel at gmail.com (Brock Mendel)
Date: Wed, 27 May 2020 14:07:41 -0700
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CALQtMBZGo9FKmx1AoH-JhnGpCH0eK5rsHvju4OZg5X9ZTiRjDw@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
 <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
 <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
 <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>
 <CAJPUwMBOgHEgdODX0vOSXQs3frpJ-y58WFcV=b9gS6fCTrxe7w@mail.gmail.com>
 <CAKf8g9TzEdnv67C4qQ4fWe874eS7UAkfJtS2_DhHC3AfHDRmqQ@mail.gmail.com>
 <CAE1aY-m7tHZJvGODPmmJ=Xa3==e_6=3KeuqOEcvOv7SfQxQPcw@mail.gmail.com>
 <CALQtMBZGo9FKmx1AoH-JhnGpCH0eK5rsHvju4OZg5X9ZTiRjDw@mail.gmail.com>
Message-ID: <CAKf8g9Qi8dUspAo-FFaZ0UrDbiaJyjjHr29r-QDG=CXyzzWSSw@mail.gmail.com>

> I don't think this "lazy _mgr attribute" is comparable in complexity with
the consolidated BlockManager

Not on its own, no.  But my prior is that this isn't the last thing that
will merit its own special case.

> I think it clear that a BlockManager with only 1D arrays/blocks *can* be
simpler as one with interleaved/consolidated blocks.

Absolutely agree.  I've spent a big chunk of the last year dealing with
BlockManager code and have no great love for it.

> But this is also only one of the arguments. Complexity alone is not a
reason to not do something; it's the general trade-off with what you gain
or lose with it.

The main upsides I see are a) internal complexity reduction, b) downstream
library upsides, c) clearer view vs copy semantics, d) perf improvements
from making fewer copies, e) clear "dict of Series" data model.

The main downside is potential performance degradation (at the extreme end
e.g. 3000x <https://github.com/pandas-dev/pandas/issues/24990> for
arithmetic).  As Wes commented some of that can be ameliorated with
compiled code but that cuts against the complexity reduction.

I am looking for ways to quantify these tradeoffs so we can make an
informed decision.

On Wed, May 27, 2020 at 12:57 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> On Tue, 26 May 2020 at 23:00, Tom Augspurger <tom.augspurger88 at gmail.com>
> wrote:
>
>>
>> On Tue, May 26, 2020 at 3:50 PM Brock Mendel <jbrockmendel at gmail.com>
>> wrote:
>>
>>> > It allows to create a "DataFrame" from an ndarray without creating a
>>> BlockManager, and it allows accessing this original ndarray:
>>>
>>> This is a neat proof of concept, but it cuts against the "decreases
>>> complexity" argument.  Is there a viable way to quantify (even very
>>> roughly) the complexity effect of going all-1D?
>>>
>>
>> That complexity is at least localized to a single attribute. That's quite
>> different from the 1D & 2D blocks situation, where many methods (though
>> fewer than a year ago) need to be concerned with whether the array in a
>> block is 1D or 2D, or whether the DataFrame is consolidated, homogenous, ...
>>
>>
> I don't think this "lazy _mgr attribute" is comparable in complexity with
> the consolidated BlockManager. Furthermore: it's targeted to a very
> specific and limited use case (and eg also doesn't need to be the default,
> I think).
> Now, exactly quantifying the effect of going all-1D, that's of course
> hard. But just one example: all code that deals with blknos/blklocs (the
> mapping between the position in the consolidated blocks and the position in
> the dataframe), which is a significant part of managers.py, could be
> simplified considerably.
>
> But anyway: I think it clear that a BlockManager with only 1D
> arrays/blocks *can* be simpler as one with interleaved/consolidated
> blocks. But this is also only one of the arguments. Complexity alone is not
> a reason to not do something; it's the general trade-off with what you gain
> or lose with it.
>
>
>> A couple ideas for ways to simplify this decision-making problem:
>>>
>>
>>
>>> 2) IIUC going all-1D will cause column indexing to always return views.
>>> Elsewhere you have noted that this is a breaking API change which merited
>>> discussion in its own right.  xref #33780
>>> <https://github.com/pandas-dev/pandas/issues/33780>.  My takeaway from
>>> this part of the last dev call was that people were generally positive on
>>> the all-views idea, but were wary of how to handle the potential
>>> deprecation.
>>>
>>
>> This type of change would merit a major version bump. If possible, we'd
>> ideally have some kind of option to disable consolidation / enable
>> splitting, which would allow for users to test their code on older versions.
>>
>
> Yes, going to an all-1D-BlockManager would be something for a major
> version bump, eg pandas 2.0. So I think that is the perfect opportunity to
> do such a change of making column selections always views.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200527/37114d07/attachment.html>

From jorisvandenbossche at gmail.com  Wed May 27 17:15:32 2020
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Wed, 27 May 2020 23:15:32 +0200
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CAKf8g9Qi8dUspAo-FFaZ0UrDbiaJyjjHr29r-QDG=CXyzzWSSw@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
 <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
 <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
 <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>
 <CAJPUwMBOgHEgdODX0vOSXQs3frpJ-y58WFcV=b9gS6fCTrxe7w@mail.gmail.com>
 <CAKf8g9TzEdnv67C4qQ4fWe874eS7UAkfJtS2_DhHC3AfHDRmqQ@mail.gmail.com>
 <CAE1aY-m7tHZJvGODPmmJ=Xa3==e_6=3KeuqOEcvOv7SfQxQPcw@mail.gmail.com>
 <CALQtMBZGo9FKmx1AoH-JhnGpCH0eK5rsHvju4OZg5X9ZTiRjDw@mail.gmail.com>
 <CAKf8g9Qi8dUspAo-FFaZ0UrDbiaJyjjHr29r-QDG=CXyzzWSSw@mail.gmail.com>
Message-ID: <CALQtMBYk5ARrQQGb=Hqf+x6qcUNdS1KDok=Goy4zqN-SoCYVHA@mail.gmail.com>

On Wed, 27 May 2020 at 23:07, Brock Mendel <jbrockmendel at gmail.com> wrote:

> > I don't think this "lazy _mgr attribute" is comparable in complexity
> with the consolidated BlockManager
>
> Not on its own, no.  But my prior is that this isn't the last thing that
> will merit its own special case.
>
> > I think it clear that a BlockManager with only 1D arrays/blocks *can* be
> simpler as one with interleaved/consolidated blocks.
>
> Absolutely agree.  I've spent a big chunk of the last year dealing with
> BlockManager code and have no great love for it.
>
> > But this is also only one of the arguments. Complexity alone is not a
> reason to not do something; it's the general trade-off with what you gain
> or lose with it.
>
> The main upsides I see are a) internal complexity reduction, b) downstream
> library upsides, c) clearer view vs copy semantics, d) perf improvements
> from making fewer copies, e) clear "dict of Series" data model.
>
> The main downside is potential performance degradation (at the extreme end
> e.g. 3000x <https://github.com/pandas-dev/pandas/issues/24990> for
> arithmetic).  As Wes commented some of that can be ameliorated with
> compiled code but that cuts against the complexity reduction.
>

That number is not correct. That was comparing the block-wise operation to
a very inefficient convert-each-column-to-a-series operation. We can
optimize this column-wise operation a lot (as I already did on master for
some cases), and then a slowdown will still be present in such extreme
cases, but *much* less.


>
> I am looking for ways to quantify these tradeoffs so we can make an
> informed decision.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200527/0d21c55f/attachment-0001.html>

From jbrockmendel at gmail.com  Fri May 29 12:37:19 2020
From: jbrockmendel at gmail.com (Brock Mendel)
Date: Fri, 29 May 2020 09:37:19 -0700
Subject: [Pandas-dev] tslibs 2.0 and non-nanosecond datetime64/timedelta64
Message-ID: <CAKf8g9T7dgRXaOxf86Cui=R+4+9GeDWjEsfg7Q=MXigrE1p_uA@mail.gmail.com>

This is a discussion of what it would take to support non-nanosecond
datetime64/timedelta64 dtypes and what decisions would need to be made
along the way.

The implementation would probably consist of:
- add a NPY_DATETIMEUNIT attribute to Timestamp and Datetime64TZDtype
- for timezone-related methods:
    - short-term: cast to nanosecond, use existing code, cast back to other
unit
    - longer-term: update existing code to support non-nano units directly
- comb through the code for all the places where we implicitly assume nano
units and update
- tests, so, so many tests

We could then consider de-duplication. Tick is already redundant with
Timedelta, and Timestamp[H] would render Period[H] redundant.  With
appropriate deprecation cycle, we could rip out a bunch of code.

Another possibility is to try to upstream some code to numpy, which they
have recently been receptive to (#16266
<https://github.com/numpy/numpy/pull/16266>, #16363
<https://github.com/numpy/numpy/pull/16363>, #16364
<https://github.com/numpy/numpy/pull/16364>, #16352
<https://github.com/numpy/numpy/issues/16352>,
<https://github.com/numpy/numpy/issues/16195>#16195
<https://github.com/numpy/numpy/issues/16195>).  @rgommers tells me that
trying to implement a tz-aware datetime64 dtype in numpy would be "folly,
that way madness lies", but that it might be more feasible once @seberg's
dtype refactor lands.  More realistically short-term, if we convinced numpy
to update NPY_DATETIMEUNIT to include the anchored quarter/year/week units
we use for Period, we could condense a lot of confusing enum-like code.

Tangentially related: with zoneinfo (PEP 615) we should consider making
those our canonical tzinfos and converting any dateutil/pytz tzinfos we
encounter to those.  They are implemented in C, so I'm _hopeful_ we can
make some of our vectorized tzconversion code unnecessary.  @pganssle has
suggested we implement our own tzinfos, but I'm holding out hope we can
keep that upstream.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200529/52fcb388/attachment.html>

From jorisvandenbossche at gmail.com  Fri May 29 13:34:01 2020
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Fri, 29 May 2020 19:34:01 +0200
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CAKf8g9Qi8dUspAo-FFaZ0UrDbiaJyjjHr29r-QDG=CXyzzWSSw@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
 <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
 <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
 <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>
 <CAJPUwMBOgHEgdODX0vOSXQs3frpJ-y58WFcV=b9gS6fCTrxe7w@mail.gmail.com>
 <CAKf8g9TzEdnv67C4qQ4fWe874eS7UAkfJtS2_DhHC3AfHDRmqQ@mail.gmail.com>
 <CAE1aY-m7tHZJvGODPmmJ=Xa3==e_6=3KeuqOEcvOv7SfQxQPcw@mail.gmail.com>
 <CALQtMBZGo9FKmx1AoH-JhnGpCH0eK5rsHvju4OZg5X9ZTiRjDw@mail.gmail.com>
 <CAKf8g9Qi8dUspAo-FFaZ0UrDbiaJyjjHr29r-QDG=CXyzzWSSw@mail.gmail.com>
Message-ID: <CALQtMBY4Tq6R7Ase2PcXCg4mYu9YqEaJcpxsVtjEUG9YYhrWOA@mail.gmail.com>

On Wed, 27 May 2020 at 23:07, Brock Mendel <jbrockmendel at gmail.com> wrote:

>
> The main upsides I see are a) internal complexity reduction, b) downstream
> library upsides, c) clearer view vs copy semantics, d) perf improvements
> from making fewer copies, e) clear "dict of Series" data model.
>
> The main downside is potential performance degradation (at the extreme end
> e.g. 3000x <https://github.com/pandas-dev/pandas/issues/24990> for
> arithmetic).  As Wes commented some of that can be ameliorated with
> compiled code but that cuts against the complexity reduction.
>
> I am looking for ways to quantify these tradeoffs so we can make an
> informed decision.
>
> Can you try to explain a bit more what kind of quantification you are
looking for?

- Complexity: I think we agree a non-consolidating block manager *can* be
simpler? (and it's not only the internals, also eg the algos become
simpler). But I am not sure this can be expressed in a number.
- Clearer view vs copy semantics: this is partly an issue of making pandas
easier to understand (both as developer and user), which again seems hard
to quantify. And partly an issue of performance / memory usage. This is
something that could potentially be measured (eg the memory usage of some
typical workflows). But this probably also something that might only show
effect after a refactor / implementation of new semantics.
- Potential performance degradation: here you can measure things, and I
actually did that for some cases, see
https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c
(the notebook that I posted in #10556
<https://github.com/pandas-dev/pandas/issues/10556> a few days ago).

However: 1) a lot depends on what kind of dataframe you take for your
benchmarks (number of rows vs number of columns), 2) there are of course a
lot of potential operations to test, 3) there will be a set of operations
that will always be slower with a columnar dataframe, whatever the
optimization, and 4) we would be testing with current pandas, which is
often not yet optimized for column-wise operations.

I would be fine with choosing a set of example datasets with example
operations, on which we can have some comparisons.
My notebook linked above is already something like that (in a limited
form), I think. From this set of timings, I personally don't see any
insurmountable performance degradations.

But I also deliberately choose a dataframe where n_rows >> n_columns,
because I personally would be fine if operations on wide dataframes (n_rows
< n_columns) show a slowdown. But that is of course something to discuss /
agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we
care about a performance degradation?).

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200529/2476b080/attachment.html>

From tom.augspurger88 at gmail.com  Fri May 29 15:03:27 2020
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Fri, 29 May 2020 14:03:27 -0500
Subject: [Pandas-dev] tslibs 2.0 and non-nanosecond
 datetime64/timedelta64
In-Reply-To: <CAKf8g9T7dgRXaOxf86Cui=R+4+9GeDWjEsfg7Q=MXigrE1p_uA@mail.gmail.com>
References: <CAKf8g9T7dgRXaOxf86Cui=R+4+9GeDWjEsfg7Q=MXigrE1p_uA@mail.gmail.com>
Message-ID: <CAE1aY-nDufgq41SVZ97HSb4Y2k0UnhN-i2Bg0Le_9JEqbJpZ1Q@mail.gmail.com>

Thanks for the update.

On Fri, May 29, 2020 at 11:37 AM Brock Mendel <jbrockmendel at gmail.com>
wrote:

> This is a discussion of what it would take to support non-nanosecond
> datetime64/timedelta64 dtypes and what decisions would need to be made
> along the way.
>
> The implementation would probably consist of:
> - add a NPY_DATETIMEUNIT attribute to Timestamp and Datetime64TZDtype
> - for timezone-related methods:
>     - short-term: cast to nanosecond, use existing code, cast back to
> other unit
>

Will this cause issues if the original datetime isn't in the bounds of a
ns-precision timestamp?


>     - longer-term: update existing code to support non-nano units directly
> - comb through the code for all the places where we implicitly assume nano
> units and update
> - tests, so, so many tests
>
> We could then consider de-duplication. Tick is already redundant with
> Timedelta, and Timestamp[H] would render Period[H] redundant.  With
> appropriate deprecation cycle, we could rip out a bunch of code.
>

What would the user facing changes that warrant deprecation? For me,
`Period` represents a span of time. It would make sense to implement
something like `pd.Timestamp("2000-01-01") in pd.Period("2000-01-01",
freq="H")`. But something checking whether that timestamp is in a
`Timestamp[H]` doesn't seem natural, since it represents a point in time
rather than a span.


> Another possibility is to try to upstream some code to numpy, which they
> have recently been receptive to (#16266
> <https://github.com/numpy/numpy/pull/16266>, #16363
> <https://github.com/numpy/numpy/pull/16363>, #16364
> <https://github.com/numpy/numpy/pull/16364>, #16352
> <https://github.com/numpy/numpy/issues/16352>,
> <https://github.com/numpy/numpy/issues/16195>#16195
> <https://github.com/numpy/numpy/issues/16195>).  @rgommers tells me that
> trying to implement a tz-aware datetime64 dtype in numpy would be "folly,
> that way madness lies", but that it might be more feasible once @seberg's
> dtype refactor lands.  More realistically short-term, if we convinced numpy
> to update NPY_DATETIMEUNIT to include the anchored quarter/year/week units
> we use for Period, we could condense a lot of confusing enum-like code.
>

Great to see this being pushed upstream!


> Tangentially related: with zoneinfo (PEP 615) we should consider making
> those our canonical tzinfos and converting any dateutil/pytz tzinfos we
> encounter to those.  They are implemented in C, so I'm _hopeful_ we can
> make some of our vectorized tzconversion code unnecessary.  @pganssle has
> suggested we implement our own tzinfos, but I'm holding out hope we can
> keep that upstream.
>

I'd be happy to see this as well, though implementing it in a way that's
compatible with older Pythons seems a bit tricky. Perhaps we get the
building blocks in place and then require it once we require Python 3.10+?


> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200529/3d30391d/attachment-0001.html>

From paul at ganssle.io  Fri May 29 15:31:54 2020
From: paul at ganssle.io (Paul Ganssle)
Date: Fri, 29 May 2020 15:31:54 -0400
Subject: [Pandas-dev] tslibs 2.0 and non-nanosecond
 datetime64/timedelta64
In-Reply-To: <CAE1aY-nDufgq41SVZ97HSb4Y2k0UnhN-i2Bg0Le_9JEqbJpZ1Q@mail.gmail.com>
References: <CAKf8g9T7dgRXaOxf86Cui=R+4+9GeDWjEsfg7Q=MXigrE1p_uA@mail.gmail.com>
 <CAE1aY-nDufgq41SVZ97HSb4Y2k0UnhN-i2Bg0Le_9JEqbJpZ1Q@mail.gmail.com>
Message-ID: <28577046-9ff1-1cd9-d64a-401962139200@ganssle.io>

>     Tangentially related: with zoneinfo (PEP 615) we should consider
>     making those our canonical tzinfos and converting any
>     dateutil/pytz tzinfos we encounter to those.? They are implemented
>     in C, so I'm _hopeful_ we can make some of our vectorized
>     tzconversion?code unnecessary.??@pganssle has suggested we
>     implement our own tzinfos, but I'm holding out hope we can keep
>     that upstream.
>
>
> I'd be happy to see this as well, though implementing it in a way
> that's compatible with older Pythons seems a bit tricky. Perhaps we
> get the building blocks in place and then require it once we require
> Python 3.10+?

The reference implementation for PEP 615 has been converted to a
backport for Python 3.6+ <https://pypi.org/project/backports.zoneinfo/>,
so as long as you're willing to take on the dependency on the backport
(which depends only on things that are in the standard library in Python
3.8+, but has transitive dependencies on backports on Python 3.6 and
3.7), you can just use that.

The trickier thing, to me, is that there is a somewhat contrived
workflow but definitely not an /implausible/ one, that would be broken
by switching away from pytz. If someone constructs an aware
Timestamp/Series/etc, then uses the tz attribute to get a time zone they
can use for other stuff, they should currently be using the
`localize`/`normalize` functions, like so:

>>> from datetime import datetime
>>> import pandas as pd
>>> ts = pd.Timestamp("2020-01-01", tz="America/New_York")
>>> ts.tz
<DstTzInfo 'America/New_York' EST-1 day, 19:00:00 STD>
>>> ts.tz.localize(datetime.now())
datetime.datetime(2020, 5, 29, 15, 16, 56, 376299, tzinfo=<DstTzInfo
'America/New_York' EDT-1 day, 20:00:00 DST>)

This is a pytz-specific idiom and won't work for zoneinfo or dateutil
zones, but it may inadvertently be part of your public API, so it's up
to you whether to consider it part of the public interface. In that
case, I think the decision should be between a hard break and having
`.tz` return a wrapper class that tries to more or less do what `pytz`
does if you call `localize`/`normalize` with it.

Best,
Paul

P.S. To clarify my position on "you should implement your own tzinfos":
I think you should /start/ with adding support for generic time zones
(not digging around into the internals to try to get speed improvements
as is currently done) and see if zoneinfo is dramatically slower. If it
is and you care about that, then a custom vectorized time zone should be
the way to go. On the plus side, it's easy to test these things that are
"rewritten for performance reasons only" using property testing; with
the right set of property tests you can still get some of the "with
enough eyes all bugs are shallow" benefits of using a standard library
module.

On 5/29/20 3:03 PM, Tom Augspurger wrote:
> Thanks for the update.
>
> On Fri, May 29, 2020 at 11:37 AM Brock Mendel <jbrockmendel at gmail.com
> <mailto:jbrockmendel at gmail.com>> wrote:
>
>     This is a discussion of what it would take to support
>     non-nanosecond datetime64/timedelta64 dtypes and what decisions
>     would need to be made along the way.
>
>     The implementation would probably consist of:
>     - add a NPY_DATETIMEUNIT attribute to Timestamp and Datetime64TZDtype
>     - for timezone-related?methods:
>     ? ? - short-term: cast to nanosecond, use existing code, cast back
>     to other unit
>
>
> Will this cause issues if the original datetime isn't in the bounds of
> a ns-precision timestamp?
> ?
>
>     ? ? - longer-term: update existing code to support non-nano units
>     directly
>     - comb through the code for all the places where we implicitly
>     assume nano units and update
>     - tests, so, so many tests
>
>     We could then consider de-duplication.?Tick is already redundant
>     with Timedelta, and Timestamp[H] would render Period[H]
>     redundant.? With appropriate deprecation cycle, we could rip out a
>     bunch of code.
>
>
> What would the user facing changes that warrant deprecation? For me,
> `Period` represents a span of time. It would make sense to implement
> something like `pd.Timestamp("2000-01-01") in pd.Period("2000-01-01",
> freq="H")`. But something checking whether that timestamp is in a
> `Timestamp[H]` doesn't seem natural, since it represents a point in
> time rather than a span.
> ?
>
>     Another possibility is to try to upstream some code to numpy,
>     which they have recently been receptive to (#16266
>     <https://github.com/numpy/numpy/pull/16266>,?#16363
>     <https://github.com/numpy/numpy/pull/16363>,?#16364
>     <https://github.com/numpy/numpy/pull/16364>,?#16352
>     <https://github.com/numpy/numpy/issues/16352>,?
>     <https://github.com/numpy/numpy/issues/16195>#16195
>     <https://github.com/numpy/numpy/issues/16195>).? @rgommers tells
>     me that trying to implement a tz-aware datetime64 dtype in numpy
>     would be "folly, that way madness lies", but that it might be more
>     feasible once?@seberg's dtype refactor lands.? More realistically
>     short-term, if we convinced numpy to update NPY_DATETIMEUNIT to
>     include the anchored quarter/year/week units we use for Period, we
>     could condense a lot of confusing enum-like code.
>
>
> Great to see this being pushed upstream!
>
> ?
>
>     _______________________________________________
>     Pandas-dev mailing list
>     Pandas-dev at python.org <mailto:Pandas-dev at python.org>
>     https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200529/fce95ff9/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200529/fce95ff9/attachment.sig>

From maartenb at xs4all.nl  Fri May 29 14:31:44 2020
From: maartenb at xs4all.nl (Maarten Ballintijn)
Date: Fri, 29 May 2020 14:31:44 -0400
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CALQtMBY4Tq6R7Ase2PcXCg4mYu9YqEaJcpxsVtjEUG9YYhrWOA@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
 <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
 <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
 <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>
 <CAJPUwMBOgHEgdODX0vOSXQs3frpJ-y58WFcV=b9gS6fCTrxe7w@mail.gmail.com>
 <CAKf8g9TzEdnv67C4qQ4fWe874eS7UAkfJtS2_DhHC3AfHDRmqQ@mail.gmail.com>
 <CAE1aY-m7tHZJvGODPmmJ=Xa3==e_6=3KeuqOEcvOv7SfQxQPcw@mail.gmail.com>
 <CALQtMBZGo9FKmx1AoH-JhnGpCH0eK5rsHvju4OZg5X9ZTiRjDw@mail.gmail.com>
 <CAKf8g9Qi8dUspAo-FFaZ0UrDbiaJyjjHr29r-QDG=CXyzzWSSw@mail.gmail.com>
 <CALQtMBY4Tq6R7Ase2PcXCg4mYu9YqEaJcpxsVtjEUG9YYhrWOA@mail.gmail.com>
Message-ID: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl>


Hi Joris,

You said:

> But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).

This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other
areas where storing data for 1000?s of elements (sensors, items, people) on grid of  time scales of minutes or more.
(n*1000 x m*1000 data with n, m ~ 10 .. 100)

Why do you think this use case is no longer important? 

We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to
improve in this area not slide back.

Have a great weekend,
Maarten


> On May 29, 2020, at 1:34 PM, Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
> 
> On Wed, 27 May 2020 at 23:07, Brock Mendel <jbrockmendel at gmail.com <mailto:jbrockmendel at gmail.com>> wrote:
> 
> The main upsides I see are a) internal complexity reduction, b) downstream library upsides, c) clearer view vs copy semantics, d) perf improvements from making fewer copies, e) clear "dict of Series" data model.
> 
> The main downside is potential performance degradation (at the extreme end e.g. 3000x <https://github.com/pandas-dev/pandas/issues/24990> for arithmetic).  As Wes commented some of that can be ameliorated with compiled code but that cuts against the complexity reduction.
> 
> I am looking for ways to quantify these tradeoffs so we can make an informed decision.
> 
> Can you try to explain a bit more what kind of quantification you are looking for? 
> 
> - Complexity: I think we agree a non-consolidating block manager can be simpler? (and it's not only the internals, also eg the algos become simpler). But I am not sure this can be expressed in a number.
> - Clearer view vs copy semantics: this is partly an issue of making pandas easier to understand (both as developer and user), which again seems hard to quantify. And partly an issue of performance / memory usage. This is something that could potentially be measured (eg the memory usage of some typical workflows). But this probably also something that might only show effect after a refactor / implementation of new semantics.
> - Potential performance degradation: here you can measure things, and I actually did that for some cases, see https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c> (the notebook that I posted in #10556 <https://github.com/pandas-dev/pandas/issues/10556> a few days ago). 
> 
> However: 1) a lot depends on what kind of dataframe you take for your benchmarks (number of rows vs number of columns), 2) there are of course a lot of potential operations to test, 3) there will be a set of operations that will always be slower with a columnar dataframe, whatever the optimization, and 4) we would be testing with current pandas, which is often not yet optimized for column-wise operations.
> 
> I would be fine with choosing a set of example datasets with example operations, on which we can have some comparisons. 
> My notebook linked above is already something like that (in a limited form), I think. From this set of timings, I personally don't see any insurmountable performance degradations. 
> 
> But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).
> 
> Joris
>  
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200529/2a6648e8/attachment-0001.html>

From jorisvandenbossche at gmail.com  Sat May 30 15:03:41 2020
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Sat, 30 May 2020 21:03:41 +0200
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
 <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
 <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
 <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>
 <CAJPUwMBOgHEgdODX0vOSXQs3frpJ-y58WFcV=b9gS6fCTrxe7w@mail.gmail.com>
 <CAKf8g9TzEdnv67C4qQ4fWe874eS7UAkfJtS2_DhHC3AfHDRmqQ@mail.gmail.com>
 <CAE1aY-m7tHZJvGODPmmJ=Xa3==e_6=3KeuqOEcvOv7SfQxQPcw@mail.gmail.com>
 <CALQtMBZGo9FKmx1AoH-JhnGpCH0eK5rsHvju4OZg5X9ZTiRjDw@mail.gmail.com>
 <CAKf8g9Qi8dUspAo-FFaZ0UrDbiaJyjjHr29r-QDG=CXyzzWSSw@mail.gmail.com>
 <CALQtMBY4Tq6R7Ase2PcXCg4mYu9YqEaJcpxsVtjEUG9YYhrWOA@mail.gmail.com>
 <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl>
Message-ID: <CALQtMBaorQdrpov=w1RDC5_DREER0u=0Vm98wWapVuca+cdydQ@mail.gmail.com>

Hi Maarten,

Thanks a lot for the feedback!

On Fri, 29 May 2020 at 20:31, Maarten Ballintijn <maartenb at xs4all.nl> wrote:

>
> Hi Joris,
>
> You said:
>
> But I also deliberately choose a dataframe where n_rows >> n_columns,
> because I personally would be fine if operations on wide dataframes (n_rows
> < n_columns) show a slowdown. But that is of course something to discuss /
> agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we
> care about a performance degradation?).
>
>
> This is an (the) important use case for us and probably for a lot of use
> in finance in general. I can easily imagine many other
> areas where storing data for 1000?s of elements (sensors, items, people)
> on grid of  time scales of minutes or more.
> (n*1000 x m*1000 data with n, m ~ 10 .. 100)
>
> Why do you think this use case is no longer important?
>

To be clear up front: I think wide dataframes are still an important use
case.

But to put my comment from above in more context: we had a performance
regression reported (#24990
<https://github.com/pandas-dev/pandas/issues/24990>, which Brock referenced
in his last mail) which was about a DataFrame with 1 row and 5000 columns.
And yes, for *such* a case, I think it will basically be impossible to
preserve exact performance, even with a lot of optimizations, compared to
storing this as a single, consolidated (1, 5000) array as is done now. And
it is for such a case, that I indeed say: I am willing to accept a limited
slowdown for this, *if* it at the same time gives us improved memory usage,
performance improvements for more common cases, simplified internals making
it easier to contribute to and further optimize pandas, etc.

But, I am also quite convinced that, with some optimization effort, we can
at least preserve the current performance even for relatively wide
dataframes (see eg this
<https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609>
notebook
<https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609>
for some quick experiments).
And to be clear: doing such optimizations to ensure good performance for a
variety of use cases is part of the proposal. Also, I think that having a
simplified pandas internals should actually also make it easier to further
explore ways to specifically optimize the "homogeneous-dtype wide
dataframe" use case.

Now, it is always difficult to make such claims in the abstract.
So what I personally think would be very valuable, is if you could give
some example use cases that you care about (eg a notebook creating some
dummy data with similar characteristics as the data you are working with
(or using real data, if openly available, and a few typical operations you
do on those).

Best,
Joris


>
> We already have to drop into numpy on occasion to make the performance
> sufficient. I would really prefer for Pandas to
> improve in this area not slide back.
>
> Have a great weekend,
> Maarten
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200530/49941c0f/attachment.html>

From jorisvandenbossche at gmail.com  Sat May 30 15:17:56 2020
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Sat, 30 May 2020 21:17:56 +0200
Subject: [Pandas-dev] tslibs 2.0 and non-nanosecond
 datetime64/timedelta64
In-Reply-To: <CAE1aY-nDufgq41SVZ97HSb4Y2k0UnhN-i2Bg0Le_9JEqbJpZ1Q@mail.gmail.com>
References: <CAKf8g9T7dgRXaOxf86Cui=R+4+9GeDWjEsfg7Q=MXigrE1p_uA@mail.gmail.com>
 <CAE1aY-nDufgq41SVZ97HSb4Y2k0UnhN-i2Bg0Le_9JEqbJpZ1Q@mail.gmail.com>
Message-ID: <CALQtMBa9MGYrWV5UhfiAvBPq466Ybq0Td17hSwkTVUetJA=HOw@mail.gmail.com>

Thanks for starting this discussion, Brock!

On Fri, 29 May 2020 at 21:03, Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

> On Fri, May 29, 2020 at 11:37 AM Brock Mendel <jbrockmendel at gmail.com>
> wrote:
>
>>
>> We could then consider de-duplication. Tick is already redundant with
>> Timedelta, and Timestamp[H] would render Period[H] redundant.  With
>> appropriate deprecation cycle, we could rip out a bunch of code.
>>
>
> What would the user facing changes that warrant deprecation? For me,
> `Period` represents a span of time. It would make sense to implement
> something like `pd.Timestamp("2000-01-01") in pd.Period("2000-01-01",
> freq="H")`. But something checking whether that timestamp is in a
> `Timestamp[H]` doesn't seem natural, since it represents a point in time
> rather than a span.
>
>
Personally, I don't think we necessarily need to add all units that are
supported by numpy's datetime64/timedelta64 dtypes. First, because I don't
think it is an important use case (people mostly want to be able to have
dates outside of the range limits that nanosecond resolution gives us), and
also because it makes it conceptually a lot more difficult. For example,
what is a "Timestamp[H]" value? Does it represent the beginning or the end
of the hour? That are questions that are already handled by the Period
dtype, and I think it is a good thing to keep those concepts separated (you
can of course ask the same question with a millisecond resolution, but I
think generally people don't do that).
Further, all the resolutions from nanosecond up to second are "just"
multiplications x1000, keeping the implementation more simple (compared to
resolutions of hours, months, ..).

So for a timestamp dtype, we could maybe only support ns / ?s / ms / s
resolutions?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200530/fd9cacc1/attachment.html>

From adrin.jalali at gmail.com  Sat May 30 17:54:59 2020
From: adrin.jalali at gmail.com (Adrin)
Date: Sat, 30 May 2020 23:54:59 +0200
Subject: [Pandas-dev] A case for a simplified (non-consolidating)
 BlockManager with 1D blocks
In-Reply-To: <CALQtMBaorQdrpov=w1RDC5_DREER0u=0Vm98wWapVuca+cdydQ@mail.gmail.com>
References: <CALQtMBY-c6zcbtjNmO2j_1aFh+jEWG2Np8ghGxo6P_GXg3=UsA@mail.gmail.com>
 <CAEOrW49RAUJXYkehZ7+w=5XCm1FWWjgj=rYdj8WdHJaEWAdOkQ@mail.gmail.com>
 <CALQtMBYVnKT-F3Z+E0Oi4PYrHY6W+kU9e6+dXGUHcVsLE_A=DQ@mail.gmail.com>
 <CAE1aY-kb_1XiY52FxC-743pHuYf6ipYg=mDak1WQA5Ys0s5pTw@mail.gmail.com>
 <CALQtMBYGenokdAYfAgzt_3j8QtE+png9W4GL-duDQdBFg=HOBw@mail.gmail.com>
 <CAJPUwMBOgHEgdODX0vOSXQs3frpJ-y58WFcV=b9gS6fCTrxe7w@mail.gmail.com>
 <CAKf8g9TzEdnv67C4qQ4fWe874eS7UAkfJtS2_DhHC3AfHDRmqQ@mail.gmail.com>
 <CAE1aY-m7tHZJvGODPmmJ=Xa3==e_6=3KeuqOEcvOv7SfQxQPcw@mail.gmail.com>
 <CALQtMBZGo9FKmx1AoH-JhnGpCH0eK5rsHvju4OZg5X9ZTiRjDw@mail.gmail.com>
 <CAKf8g9Qi8dUspAo-FFaZ0UrDbiaJyjjHr29r-QDG=CXyzzWSSw@mail.gmail.com>
 <CALQtMBY4Tq6R7Ase2PcXCg4mYu9YqEaJcpxsVtjEUG9YYhrWOA@mail.gmail.com>
 <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl>
 <CALQtMBaorQdrpov=w1RDC5_DREER0u=0Vm98wWapVuca+cdydQ@mail.gmail.com>
Message-ID: <CAEOrW4-=gyyjOD+bzMxJ8q4=Ca+KZRWUK9mv2yvUV9Xmddtmqw@mail.gmail.com>

Although 1 x 5000 may sound an edge case, my whole 4 years of research was
on 500 x 450000 data. Those usecases are probably more common than we may
think.

On Sat., May 30, 2020, 21:03 Joris Van den Bossche, <
jorisvandenbossche at gmail.com> wrote:

> Hi Maarten,
>
> Thanks a lot for the feedback!
>
> On Fri, 29 May 2020 at 20:31, Maarten Ballintijn <maartenb at xs4all.nl>
> wrote:
>
>>
>> Hi Joris,
>>
>> You said:
>>
>> But I also deliberately choose a dataframe where n_rows >> n_columns,
>> because I personally would be fine if operations on wide dataframes (n_rows
>> < n_columns) show a slowdown. But that is of course something to discuss /
>> agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we
>> care about a performance degradation?).
>>
>>
>> This is an (the) important use case for us and probably for a lot of use
>> in finance in general. I can easily imagine many other
>> areas where storing data for 1000?s of elements (sensors, items, people)
>> on grid of  time scales of minutes or more.
>> (n*1000 x m*1000 data with n, m ~ 10 .. 100)
>>
>> Why do you think this use case is no longer important?
>>
>
> To be clear up front: I think wide dataframes are still an important use
> case.
>
> But to put my comment from above in more context: we had a performance
> regression reported (#24990
> <https://github.com/pandas-dev/pandas/issues/24990>, which Brock
> referenced in his last mail) which was about a DataFrame with 1 row and
> 5000 columns.
> And yes, for *such* a case, I think it will basically be impossible to
> preserve exact performance, even with a lot of optimizations, compared to
> storing this as a single, consolidated (1, 5000) array as is done now. And
> it is for such a case, that I indeed say: I am willing to accept a limited
> slowdown for this, *if* it at the same time gives us improved memory
> usage, performance improvements for more common cases, simplified internals
> making it easier to contribute to and further optimize pandas, etc.
>
> But, I am also quite convinced that, with some optimization effort, we can
> at least preserve the current performance even for relatively wide
> dataframes (see eg this
> <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609>
> notebook
> <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609>
> for some quick experiments).
> And to be clear: doing such optimizations to ensure good performance for a
> variety of use cases is part of the proposal. Also, I think that having a
> simplified pandas internals should actually also make it easier to further
> explore ways to specifically optimize the "homogeneous-dtype wide
> dataframe" use case.
>
> Now, it is always difficult to make such claims in the abstract.
> So what I personally think would be very valuable, is if you could give
> some example use cases that you care about (eg a notebook creating some
> dummy data with similar characteristics as the data you are working with
> (or using real data, if openly available, and a few typical operations you
> do on those).
>
> Best,
> Joris
>
>
>>
>> We already have to drop into numpy on occasion to make the performance
>> sufficient. I would really prefer for Pandas to
>> improve in this area not slide back.
>>
>> Have a great weekend,
>> Maarten
>>
>>
>> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200530/883eb4b1/attachment-0001.html>

From sebastian at sipsolutions.net  Sat May 30 18:39:07 2020
From: sebastian at sipsolutions.net (Sebastian Berg)
Date: Sat, 30 May 2020 17:39:07 -0500
Subject: [Pandas-dev] tslibs 2.0 and non-nanosecond
 datetime64/timedelta64
In-Reply-To: <CAKf8g9T7dgRXaOxf86Cui=R+4+9GeDWjEsfg7Q=MXigrE1p_uA@mail.gmail.com>
 (sfid-20200529_183734_278427_78B1B51E)
References: <CAKf8g9T7dgRXaOxf86Cui=R+4+9GeDWjEsfg7Q=MXigrE1p_uA@mail.gmail.com>
 (sfid-20200529_183734_278427_78B1B51E)
Message-ID: <b82218c265a84b4977f72aab818d58c5b8a60e86.camel@sipsolutions.net>

On Fri, 2020-05-29 at 09:37 -0700, Brock Mendel wrote:
> This is a discussion of what it would take to support non-nanosecond
> datetime64/timedelta64 dtypes and what decisions would need to be
> made
> along the way.
> 
> The implementation would probably consist of:
> - add a NPY_DATETIMEUNIT attribute to Timestamp and Datetime64TZDtype
> - for timezone-related methods:
>     - short-term: cast to nanosecond, use existing code, cast back to
> other
> unit
>     - longer-term: update existing code to support non-nano units
> directly
> - comb through the code for all the places where we implicitly assume
> nano
> units and update
> - tests, so, so many tests
> 
> We could then consider de-duplication. Tick is already redundant with
> Timedelta, and Timestamp[H] would render Period[H] redundant.  With
> appropriate deprecation cycle, we could rip out a bunch of code.
> 
> Another possibility is to try to upstream some code to numpy, which
> they
> have recently been receptive to (#16266
> <https://github.com/numpy/numpy/pull/16266>;, #16363
> <https://github.com/numpy/numpy/pull/16363>;, #16364
> <https://github.com/numpy/numpy/pull/16364>;, #16352
> <https://github.com/numpy/numpy/issues/16352>;,
> <https://github.com/numpy/numpy/issues/16195>#16195
> <https://github.com/numpy/numpy/issues/16195>;).  @rgommers tells me
> that
> trying to implement a tz-aware datetime64 dtype in numpy would be
> "folly,
> that way madness lies", but that it might be more feasible once
> @seberg's
> dtype refactor lands.

Timezones do seem like to much complexity to add to numpy.  And with
dtypes refactor should not actually be required to live within NumPy
hopefully soon.  The more likely discussion would be to go the opposite
direction :).  Since:

    np.array([datetime.datetime(2019, 1, 1)])

gives an object array, NumPy datetimes should not have any long term
advantage over an externally developed datetime (except living in the
prominent numpy namespace).

Having a new datetime dtype external to NumPy and with tz-info indeed
seems very desirable.  And I would be happy to have you in the loop, so
we could maybe even use it as an early test balloon by including it as
a test in NumPy. With the idea to later cut it out as a stand-alone
package.
But that would be mostly useful if you are excited to about getting a
small head-start.  In the end, it would likely help me/NumPy more then
you in terms of time-investment.

> More realistically short-term, if we convinced numpy
> to update NPY_DATETIMEUNIT to include the anchored quarter/year/week
> units
> we use for Period, we could condense a lot of confusing enum-like
> code.

On first sight, that does sound reasonable and probably only depends on
the complexity.  If it does not increase numpy's code complexity too
much (and obviously it decreases pandas' quite a bit more).  I assume
that this would mainly move some fairly straight forward and thoroughly
tested code from pandas into NumPy?

Can't say I am excited about reviewing datetime code, but upstreaming
seems much better for the community than band-aids in pandas...

- Sebastian


> 
> Tangentially related: with zoneinfo (PEP 615) we should consider
> making
> those our canonical tzinfos and converting any dateutil/pytz tzinfos
> we
> encounter to those.  They are implemented in C, so I'm _hopeful_ we
> can
> make some of our vectorized tzconversion code unnecessary.  @pganssle
> has
> suggested we implement our own tzinfos, but I'm holding out hope we
> can
> keep that upstream.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200530/e88dd6b0/attachment.sig>