From stefan.pankoke at googlemail.com Fri May 15 16:17:57 2020 From: stefan.pankoke at googlemail.com (Dr. Leo) Date: Fri, 15 May 2020 22:17:57 +0200 Subject: [Pandas-dev] [ANN] pandaSDMX 1.0.0 released In-Reply-To: References: Message-ID: Hi, Two years after the 0.9 release I am pleased to announce the availability of pandaSDMX 1.0.0. This is a major feature release including rewrites in virtually all areas. Certain backwards-incompatible API changes appeared inevitable, but are largely outweighed by a host of new and enhanced features. Highlights include: ? more complete and accurate implementation of the SDMX information model including hierarchical code-lists and facets. This feature was overdue. It will considerably facilitate the interpretation and representation of data and metadata. ? Better handling of idiosyncrasies of data sources should ease data acquisition in corner-cases ? Streamlined API and more informative string representations for interactive data acquisition and exploration ? The information model has been decoupled from the XML and JSON readers so that arbitrary data sources outside the SDMX ecosystem can be embedded programmatically. This shift in architecture could eventually seed the transformation of pandaSDMX from a pure client library to an end-to-end SDMX platform for the generation of SDMX files served over HTTP. ? Easier and more flexible configuration of HTTP connections through user-provided requests Sessions ? vastly extended test suite and streamlined documentation ? Modern code base leveraging typing and pydantic Quick start and links ------------------------- ? Installation (requires Python 3.7): $ pip install pandasdmx ? Documentation: https:pandasdmx.readthedocs.io/ ? Github: https://github.com/dr-leo/pandaSDMX Roadmap --------- ? add an intake driver/plugin exposing SDMX datasets and metadata ? provide conda package InternationalStrings: re-implement support for locale selection ? support SDMXJSON structure-messages which have recently been added to the SDMX standard ? Fix a few known issues Help wanted! Credits ----------- Many great people have generously contributed to this release. Even the lion's share of the development was temporarily shouldered by a single collaborator. A big thanks to all of them! What is pandaSDMX? ---------------------- pandaSDMX is an Apache 2.0-licensed Python library that implements SDMX 2.1 (ISO 17369:2013), a format for exchange of statistical data and metadata used by national statistical agencies, central banks, and international organisations. pandaSDMX can be used to: ? explore the data and metadata available from many data providers such as the World Bank, International Monetary Fund, Eurostat, the ECB, OECD, and United Nations; ? parse data and metadata in SDMX-ML (XML) or SDMX-JSON formats?either: o from local files, or o retrieved from SDMX web services, with query validation and caching; ? convert data and metadata into pandas objects, for use with the analysis, plotting, and other tools in the Python data ecosystem. From stefan.pankoke at googlemail.com Fri May 15 16:26:01 2020 From: stefan.pankoke at googlemail.com (Dr. Leo) Date: Fri, 15 May 2020 22:26:01 +0200 Subject: [Pandas-dev] [ANN] pandaSDMX 1.0.0 released Message-ID: <32fe3f67-ff93-4909-efce-26996df386c2@gmail.com> [ANN] pandaSDMX 1.0.0 released Hi, Two years after the 0.9 release I am pleased to announce the availability of pandaSDMX 1.0.0. This is a major feature release including rewrites in virtually all areas. Certain backwards-incompatible API changes appeared inevitable, but are largely outweighed by a host of new and enhanced features. Highlights include: ? more complete and accurate implementation of the SDMX information model including hierarchical code-lists and facets. This feature was overdue. It will considerably facilitate the interpretation and representation of data and metadata. ? Better handling of idiosyncrasies of data sources should ease data acquisition in corner-cases ? Streamlined API and more informative string representations for interactive data acquisition and exploration ? The information model has been decoupled from the XML and JSON readers so that arbitrary data sources outside the SDMX ecosystem can be embedded programmatically. This shift in architecture could eventually seed the transformation of pandaSDMX from a pure client library to an end-to-end SDMX platform for the generation of SDMX files served over HTTP. ? Easier and more flexible configuration of HTTP connections through user-provided requests Sessions ? vastly extended test suite and streamlined documentation ? Modern code base leveraging typing and pydantic Quick start and links ------------------------- ? Installation (requires Python 3.7): $ pip install pandasdmx ? Documentation: https:pandasdmx.readthedocs.io/ ? Github: https://github.com/dr-leo/pandaSDMX Roadmap --------- ? add an intake driver/plugin exposing SDMX datasets and metadata ? provide conda package InternationalStrings: re-implement support for locale selection ? support SDMXJSON structure-messages which have recently been added to the SDMX standard ? Fix a few known issues Help wanted! Credits ----------- Many great people have generously contributed to this release. Even the lion's share of the development was temporarily shouldered by a single collaborator. A big thanks to all of them! What is pandaSDMX? ---------------------- pandaSDMX is an Apache 2.0-licensed Python library that implements SDMX 2.1 (ISO 17369:2013), a format for exchange of statistical data and metadata used by national statistical agencies, central banks, and international organisations. pandaSDMX can be used to: ? explore the data and metadata available from many data providers such as the World Bank, International Monetary Fund, Eurostat, the ECB, OECD, and United Nations; ? parse data and metadata in SDMX-ML (XML) or SDMX-JSON formats?either: o from local files, or o retrieved from SDMX web services, with query validation and caching; ? convert data and metadata into pandas objects, for use with the analysis, plotting, and other tools in the Python data ecosystem. From jorisvandenbossche at gmail.com Mon May 25 17:39:13 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Mon, 25 May 2020 23:39:13 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks Message-ID: Hi list, Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see here ), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see here ). But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice blog post about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks. *Simplication of the internals* It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks. Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks. Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation. Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will *also* be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc. I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays. *Performance* Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood. However, based on recent experimentation with this (eg triggered by the block-wise frame ops PR , and see also some benchmarks I justed posted in #10556 / this gist ), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK. Further, there are also operations that will *benefit* from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed. Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about: - With limited effort optimizing the column-wise code paths in the internals, we can get a long way. - After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager). *Possibility to get better copy/view semantics* Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns. *No consolidation = less copying.* Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe. *Copy / view semantics* Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout). Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this. ------------------------------ *So what are the reasons to have 2D blocks?* I personally don't directly see reasons to have 2D blocks *for pandas itself* (apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up. But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines. However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. (but will stop here, as this mail is getting already long ..). Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Mon May 25 18:45:57 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Mon, 25 May 2020 15:45:57 -0700 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: Thanks for writing this up, Joris. Assuming we go down this path, do you have an idea of how we get from here to there incrementally? i.e. presumably this wont just be one massive PR On Mon, May 25, 2020 at 2:39 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi list, > > Rewriting the BlockManager based on a simpler collection of 1D-arrays is > actually on our roadmap (see here > ), > and I also touched on it in a mailing list discussion about pandas 2.0 > earlier this year (see here > ). > > But since the topic came up again recently at the last online dev meeting > (and also Uwe Korn who wrote a nice blog post > about this > yesterday), I thought to do a write-up of my thoughts on why I think we > should actually move towards a simpler, non-consolidating BlockManager with > 1D blocks. > > > *Simplication of the internals* > > It's regularly brought up as a reason to have 2D EextensionArrays (EAs) > because right now we have a lot of special cases for 1D EAs in the > internals. But to be clear: the additional complexity does not come from 1D > EAs in itself, it comes from the fact that we have a mixture of 2D and 1D > blocks. > Solving this would require a consistent block dimension, and thus removing > this added complexity can be done in two ways: have all 1D blocks, or have > all 2D blocks. > Just to say: IMO, this is not an argument in favor of 2D blocks / > consolidation. > > Moreover, when going with all 1D blocks, we cannot only remove the added > complexity from dealing with the mixture of 1D/2D blocks, we will *also* be > able to reduce the complexity of dealing with 2D blocks. A BlockManager > with 2D blocks is inherently more complex than with 1D blocks, as one needs > to deal with proper alignment of the blocks, a more complex "placement" > logic of the blocks, etc. > > I think we would be able to simplify the internals a lot by going with a > BlockManager as a store of 1D arrays. > > > *Performance* > > Performance is typically given as a reason to have consolidated, 2D > blocks. And of course, certain operations (especially row-wise operations, > or on dataframes with more columns as rows) will always be faster when done > on a 2D numpy array under the hood. > However, based on recent experimentation with this (eg triggered by the block-wise > frame ops PR , and see > also some benchmarks I justed posted in #10556 > > / this gist > ), > I also think that for many operations and with decent-sized dataframes, > this performance penalty is actually quite OK. > > Further, there are also operations that will *benefit* from 1D blocks. > First, operations that now involve aligning/splitting blocks, > re-consolidation, .. will benefit (e.g. a large part of the slowdown doing > frame/frame operations column-wise is currently due to the consolidation in > the end). And operations like adding a column, concatting (with axis=1) or > merging dataframes will be much faster when no consolidation is needed. > > Personally, I am convinced that with some effort, we can get on-par or > sometimes even better performance with 1D blocks compared to the > performance we have now for those cases that 90+% of our users care about: > > - With limited effort optimizing the column-wise code paths in the > internals, we can get a long way. > - After that, if needed, we can still consider if parts of the > internals could be cythonized to further improve certain bottlenecks (and > actually cythonizing this will also be simpler for a simpler > non-consolidating block manager). > > > *Possibility to get better copy/view semantics* > > Pandas is badly known for how much it copies ("you need 10x the memory > available as the size of your dataframe"), and having 1D blocks will allow > us to address part of those concerns. > > *No consolidation = less copying.* Regularly consolidating introduces > copies, and thus removing consolidation will mean less copies. For example, > this would enable that you can actually add a single column to a dataframe > without having to copy to the full dataframe. > > *Copy / view semantics* Recently there has been discussion again around > whether selecting columns should be a copy or a view, and some other issues > were opened with questions about views/copies when slicing columns. In the > consolidated 2D block layout this will always be inherently messy, and > unpredictable (meaning: depending on the actual block layout, which means > in practice unpredictable for the user unaware of the block layout). > Going with a non-consolidated BlockManager should at least allow us to get > better / more understandable semantics around this. > > > ------------------------------ > > *So what are the reasons to have 2D blocks?* > > I personally don't directly see reasons to have 2D blocks *for pandas > itself* (apart from performance in certain row-wise use cases, and except > for the fact that we have "always done it like this"). But quite likely I > am missing reasons, so please bring them up. > > But I think there are certainly use cases where 2D blocks can be useful, > but typically "external" (but nonetheless important) use cases: conversion > to/from numpy, xarray, etc. A typical example that has recently come up is > scikit-learn, where they want to have a cheap dataframe <-> numpy array > roundtrip for use in their pipelines. > However, I personally think there are possible ways that we can still > accommodate for those use cases, with some effort, while still having 1D > Blocks in pandas itself. So IMO this is not sufficient to warrant the > complexity of 2D blocks in pandas. > (but will stop here, as this mail is getting already long ..). > > Joris > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Tue May 26 03:50:37 2020 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 26 May 2020 09:50:37 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: Hi Joris, Thanks for the summary. I think another missing point is the roundtrip conversion to/from sparse matrices. There are some benchmarks and discussion here; https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097 and here's some discussion on the pandas issue tracker: https://github.com/pandas-dev/pandas/issues/33182 and some benchmark by Tom, assuming pandas would accept a 2D sparse array: https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615440896 What do you think of these usecases? Thanks, Adrin On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi list, > > Rewriting the BlockManager based on a simpler collection of 1D-arrays is > actually on our roadmap (see here > ), > and I also touched on it in a mailing list discussion about pandas 2.0 > earlier this year (see here > ). > > But since the topic came up again recently at the last online dev meeting > (and also Uwe Korn who wrote a nice blog post > about this > yesterday), I thought to do a write-up of my thoughts on why I think we > should actually move towards a simpler, non-consolidating BlockManager with > 1D blocks. > > > *Simplication of the internals* > > It's regularly brought up as a reason to have 2D EextensionArrays (EAs) > because right now we have a lot of special cases for 1D EAs in the > internals. But to be clear: the additional complexity does not come from 1D > EAs in itself, it comes from the fact that we have a mixture of 2D and 1D > blocks. > Solving this would require a consistent block dimension, and thus removing > this added complexity can be done in two ways: have all 1D blocks, or have > all 2D blocks. > Just to say: IMO, this is not an argument in favor of 2D blocks / > consolidation. > > Moreover, when going with all 1D blocks, we cannot only remove the added > complexity from dealing with the mixture of 1D/2D blocks, we will *also* be > able to reduce the complexity of dealing with 2D blocks. A BlockManager > with 2D blocks is inherently more complex than with 1D blocks, as one needs > to deal with proper alignment of the blocks, a more complex "placement" > logic of the blocks, etc. > > I think we would be able to simplify the internals a lot by going with a > BlockManager as a store of 1D arrays. > > > *Performance* > > Performance is typically given as a reason to have consolidated, 2D > blocks. And of course, certain operations (especially row-wise operations, > or on dataframes with more columns as rows) will always be faster when done > on a 2D numpy array under the hood. > However, based on recent experimentation with this (eg triggered by the block-wise > frame ops PR , and see > also some benchmarks I justed posted in #10556 > > / this gist > ), > I also think that for many operations and with decent-sized dataframes, > this performance penalty is actually quite OK. > > Further, there are also operations that will *benefit* from 1D blocks. > First, operations that now involve aligning/splitting blocks, > re-consolidation, .. will benefit (e.g. a large part of the slowdown doing > frame/frame operations column-wise is currently due to the consolidation in > the end). And operations like adding a column, concatting (with axis=1) or > merging dataframes will be much faster when no consolidation is needed. > > Personally, I am convinced that with some effort, we can get on-par or > sometimes even better performance with 1D blocks compared to the > performance we have now for those cases that 90+% of our users care about: > > - With limited effort optimizing the column-wise code paths in the > internals, we can get a long way. > - After that, if needed, we can still consider if parts of the > internals could be cythonized to further improve certain bottlenecks (and > actually cythonizing this will also be simpler for a simpler > non-consolidating block manager). > > > *Possibility to get better copy/view semantics* > > Pandas is badly known for how much it copies ("you need 10x the memory > available as the size of your dataframe"), and having 1D blocks will allow > us to address part of those concerns. > > *No consolidation = less copying.* Regularly consolidating introduces > copies, and thus removing consolidation will mean less copies. For example, > this would enable that you can actually add a single column to a dataframe > without having to copy to the full dataframe. > > *Copy / view semantics* Recently there has been discussion again around > whether selecting columns should be a copy or a view, and some other issues > were opened with questions about views/copies when slicing columns. In the > consolidated 2D block layout this will always be inherently messy, and > unpredictable (meaning: depending on the actual block layout, which means > in practice unpredictable for the user unaware of the block layout). > Going with a non-consolidated BlockManager should at least allow us to get > better / more understandable semantics around this. > > > ------------------------------ > > *So what are the reasons to have 2D blocks?* > > I personally don't directly see reasons to have 2D blocks *for pandas > itself* (apart from performance in certain row-wise use cases, and except > for the fact that we have "always done it like this"). But quite likely I > am missing reasons, so please bring them up. > > But I think there are certainly use cases where 2D blocks can be useful, > but typically "external" (but nonetheless important) use cases: conversion > to/from numpy, xarray, etc. A typical example that has recently come up is > scikit-learn, where they want to have a cheap dataframe <-> numpy array > roundtrip for use in their pipelines. > However, I personally think there are possible ways that we can still > accommodate for those use cases, with some effort, while still having 1D > Blocks in pandas itself. So IMO this is not sufficient to warrant the > complexity of 2D blocks in pandas. > (but will stop here, as this mail is getting already long ..). > > Joris > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue May 26 04:35:17 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 26 May 2020 10:35:17 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: Thanks for those links! Personally, I see the "roundtrip conversion to/from sparse matrices" a bit as in the same bucket as conversion to/from a 2D numpy array. Yes, both are important use cases. But the question we need to ask ourselves is still: is this important enough to hugely complicate the pandas' internals and block several other improvements? It's a trade-off that we need to make. Moreover, I think that we could accommodate the important part of those use cases also with a column-store DataFrame, with some effort (but with less complexity as a consolidated BlockManager). Focusing on scikit-learn: in the end, you mostly care about cheap roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame to carry feature labels in between steps of a pipeline, correct? Such cheap roundtripping is only possible anyway if you have a single dtype for all columns (which is typically the case after some transformation step). So you don't necessarily need consolidated blocks specifically, but rather the ability to store a *single* 2D array/matrix in a DataFrame (so kind of a single 2D block). Thinking out loud here, didn't try anything in code: - We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline. - We could take the above a step further and try to preserve the 2D array under the hood in some "easy" operations (but again, limited to a single 2D block/array, not multiple consolidated blocks). This is actually similar to the DataMatrix that pandas had a very long time ago. Of course this adds back complexity, so this would need some more exploration to see if how this would be possible (without duplicating a lot), and some buy-in from people interested in this. I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?). I think the second idea is also interesting: IMO such a data structure would be useful to have somewhere in the PyData ecosystem, and a worthwhile discussion to think about where this could fit. Maybe the answer is simply: use xarray for this use case (although there are still differences) ? That are interesting discussions, but personally I would not complicate the core pandas data model for heterogeneous dataframes to accommodate the single-dtype + fixed number of columns use case. Joris On Tue, 26 May 2020 at 09:50, Adrin wrote: > Hi Joris, > > Thanks for the summary. I think another missing point is the roundtrip > conversion to/from sparse matrices. > There are some benchmarks and discussion here; > https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097 > and here's some discussion on the pandas issue tracker: > https://github.com/pandas-dev/pandas/issues/33182 > and some benchmark by Tom, assuming pandas would accept a 2D sparse array: > https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615440896 > > What do you think of these usecases? > > Thanks, > Adrin > > On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Hi list, >> >> Rewriting the BlockManager based on a simpler collection of 1D-arrays is >> actually on our roadmap (see here >> ), >> and I also touched on it in a mailing list discussion about pandas 2.0 >> earlier this year (see here >> >> ). >> >> But since the topic came up again recently at the last online dev meeting >> (and also Uwe Korn who wrote a nice blog post >> about this >> yesterday), I thought to do a write-up of my thoughts on why I think we >> should actually move towards a simpler, non-consolidating BlockManager with >> 1D blocks. >> >> >> *Simplication of the internals* >> >> It's regularly brought up as a reason to have 2D EextensionArrays (EAs) >> because right now we have a lot of special cases for 1D EAs in the >> internals. But to be clear: the additional complexity does not come from 1D >> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D >> blocks. >> Solving this would require a consistent block dimension, and thus >> removing this added complexity can be done in two ways: have all 1D blocks, >> or have all 2D blocks. >> Just to say: IMO, this is not an argument in favor of 2D blocks / >> consolidation. >> >> Moreover, when going with all 1D blocks, we cannot only remove the added >> complexity from dealing with the mixture of 1D/2D blocks, we will *also* be >> able to reduce the complexity of dealing with 2D blocks. A BlockManager >> with 2D blocks is inherently more complex than with 1D blocks, as one needs >> to deal with proper alignment of the blocks, a more complex "placement" >> logic of the blocks, etc. >> >> I think we would be able to simplify the internals a lot by going with a >> BlockManager as a store of 1D arrays. >> >> >> *Performance* >> >> Performance is typically given as a reason to have consolidated, 2D >> blocks. And of course, certain operations (especially row-wise operations, >> or on dataframes with more columns as rows) will always be faster when done >> on a 2D numpy array under the hood. >> However, based on recent experimentation with this (eg triggered by the block-wise >> frame ops PR , and see >> also some benchmarks I justed posted in #10556 >> >> / this gist >> ), >> I also think that for many operations and with decent-sized dataframes, >> this performance penalty is actually quite OK. >> >> Further, there are also operations that will *benefit* from 1D blocks. >> First, operations that now involve aligning/splitting blocks, >> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing >> frame/frame operations column-wise is currently due to the consolidation in >> the end). And operations like adding a column, concatting (with axis=1) or >> merging dataframes will be much faster when no consolidation is needed. >> >> Personally, I am convinced that with some effort, we can get on-par or >> sometimes even better performance with 1D blocks compared to the >> performance we have now for those cases that 90+% of our users care about: >> >> - With limited effort optimizing the column-wise code paths in the >> internals, we can get a long way. >> - After that, if needed, we can still consider if parts of the >> internals could be cythonized to further improve certain bottlenecks (and >> actually cythonizing this will also be simpler for a simpler >> non-consolidating block manager). >> >> >> *Possibility to get better copy/view semantics* >> >> Pandas is badly known for how much it copies ("you need 10x the memory >> available as the size of your dataframe"), and having 1D blocks will allow >> us to address part of those concerns. >> >> *No consolidation = less copying.* Regularly consolidating introduces >> copies, and thus removing consolidation will mean less copies. For example, >> this would enable that you can actually add a single column to a dataframe >> without having to copy to the full dataframe. >> >> *Copy / view semantics* Recently there has been discussion again around >> whether selecting columns should be a copy or a view, and some other issues >> were opened with questions about views/copies when slicing columns. In the >> consolidated 2D block layout this will always be inherently messy, and >> unpredictable (meaning: depending on the actual block layout, which means >> in practice unpredictable for the user unaware of the block layout). >> Going with a non-consolidated BlockManager should at least allow us to >> get better / more understandable semantics around this. >> >> >> ------------------------------ >> >> *So what are the reasons to have 2D blocks?* >> >> I personally don't directly see reasons to have 2D blocks *for pandas >> itself* (apart from performance in certain row-wise use cases, and >> except for the fact that we have "always done it like this"). But quite >> likely I am missing reasons, so please bring them up. >> >> But I think there are certainly use cases where 2D blocks can be useful, >> but typically "external" (but nonetheless important) use cases: conversion >> to/from numpy, xarray, etc. A typical example that has recently come up is >> scikit-learn, where they want to have a cheap dataframe <-> numpy array >> roundtrip for use in their pipelines. >> However, I personally think there are possible ways that we can still >> accommodate for those use cases, with some effort, while still having 1D >> Blocks in pandas itself. So IMO this is not sufficient to warrant the >> complexity of 2D blocks in pandas. >> (but will stop here, as this mail is getting already long ..). >> >> Joris >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue May 26 04:55:08 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 26 May 2020 10:55:08 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: On Tue, 26 May 2020 at 00:46, Brock Mendel wrote: > Thanks for writing this up, Joris. Assuming we go down this path, do you > have an idea of how we get from here to there incrementally? i.e. > presumably this wont just be one massive PR > Yes, this is certainly not a one-PR change. I think there are multiple options for working towards this, that are worth discussing. But personally, I would first like to focus on the "assuming we go down this path" part. Let's discuss the pros and cons and trade-offs, and try to turn assumptions in an agreed-upon roadmap. (and of course, it's not because something is on our roadmap that it can't be questioned and discussed again in the future, as we are also doing now). --- Some thoughts on possible options: - We briefly discussed before the idea of using (nullable) extension dtypes for all dtypes by default in pandas 2.0. If we strive towards that, and assuming we keep the current 1D-restriction on ExtensionBlock, then we would "automatically" get a BlockManager with 1D blocks. And we could then focus on optimizing some code paths (eg constructing a new block) specifically for the case of 1D ExtensionBlocks. - A "consolidation policy" option similarly as in the branch discussed in https://github.com/pandas-dev/pandas/issues/10556. Right now, that branch still uses 2D blocks (but separate 2D blocks of shape (1, n) per column) and not actually 1D blocks. So we could add 1D versions of our numeric blocks as well. But that would probably add a lot of complexity, although temporary, to the Blocks, so maybe not an ideal path forward. - Add a version of the ExtensionBlock but that can work with numpy arrays instead of extension arrays, or actually use the "PandasArrays" to store it them in the existing ExtensionBlock (so to already start using the existing 1D blocks without requiring all dtypes to be extension dtypes). Those are all about BlockManager with 1D blocks. Once we only have 1D Blocks, I suppose there are many things we could simplify in the current BlockManager. The intermediate step of the current BlockManager with 1D blocks might not be an optimal situation, but seems the easiest as intermediate goal in practice. It probably also depends on how much "backwards compatibility" or "transition period" we want to provide. > On Mon, May 25, 2020 at 2:39 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Hi list, >> >> Rewriting the BlockManager based on a simpler collection of 1D-arrays is >> actually on our roadmap (see here >> ), >> and I also touched on it in a mailing list discussion about pandas 2.0 >> earlier this year (see here >> >> ). >> >> But since the topic came up again recently at the last online dev meeting >> (and also Uwe Korn who wrote a nice blog post >> about this >> yesterday), I thought to do a write-up of my thoughts on why I think we >> should actually move towards a simpler, non-consolidating BlockManager with >> 1D blocks. >> >> >> *Simplication of the internals* >> >> It's regularly brought up as a reason to have 2D EextensionArrays (EAs) >> because right now we have a lot of special cases for 1D EAs in the >> internals. But to be clear: the additional complexity does not come from 1D >> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D >> blocks. >> Solving this would require a consistent block dimension, and thus >> removing this added complexity can be done in two ways: have all 1D blocks, >> or have all 2D blocks. >> Just to say: IMO, this is not an argument in favor of 2D blocks / >> consolidation. >> >> Moreover, when going with all 1D blocks, we cannot only remove the added >> complexity from dealing with the mixture of 1D/2D blocks, we will *also* be >> able to reduce the complexity of dealing with 2D blocks. A BlockManager >> with 2D blocks is inherently more complex than with 1D blocks, as one needs >> to deal with proper alignment of the blocks, a more complex "placement" >> logic of the blocks, etc. >> >> I think we would be able to simplify the internals a lot by going with a >> BlockManager as a store of 1D arrays. >> >> >> *Performance* >> >> Performance is typically given as a reason to have consolidated, 2D >> blocks. And of course, certain operations (especially row-wise operations, >> or on dataframes with more columns as rows) will always be faster when done >> on a 2D numpy array under the hood. >> However, based on recent experimentation with this (eg triggered by the block-wise >> frame ops PR , and see >> also some benchmarks I justed posted in #10556 >> >> / this gist >> ), >> I also think that for many operations and with decent-sized dataframes, >> this performance penalty is actually quite OK. >> >> Further, there are also operations that will *benefit* from 1D blocks. >> First, operations that now involve aligning/splitting blocks, >> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing >> frame/frame operations column-wise is currently due to the consolidation in >> the end). And operations like adding a column, concatting (with axis=1) or >> merging dataframes will be much faster when no consolidation is needed. >> >> Personally, I am convinced that with some effort, we can get on-par or >> sometimes even better performance with 1D blocks compared to the >> performance we have now for those cases that 90+% of our users care about: >> >> - With limited effort optimizing the column-wise code paths in the >> internals, we can get a long way. >> - After that, if needed, we can still consider if parts of the >> internals could be cythonized to further improve certain bottlenecks (and >> actually cythonizing this will also be simpler for a simpler >> non-consolidating block manager). >> >> >> *Possibility to get better copy/view semantics* >> >> Pandas is badly known for how much it copies ("you need 10x the memory >> available as the size of your dataframe"), and having 1D blocks will allow >> us to address part of those concerns. >> >> *No consolidation = less copying.* Regularly consolidating introduces >> copies, and thus removing consolidation will mean less copies. For example, >> this would enable that you can actually add a single column to a dataframe >> without having to copy to the full dataframe. >> >> *Copy / view semantics* Recently there has been discussion again around >> whether selecting columns should be a copy or a view, and some other issues >> were opened with questions about views/copies when slicing columns. In the >> consolidated 2D block layout this will always be inherently messy, and >> unpredictable (meaning: depending on the actual block layout, which means >> in practice unpredictable for the user unaware of the block layout). >> Going with a non-consolidated BlockManager should at least allow us to >> get better / more understandable semantics around this. >> >> >> ------------------------------ >> >> *So what are the reasons to have 2D blocks?* >> >> I personally don't directly see reasons to have 2D blocks *for pandas >> itself* (apart from performance in certain row-wise use cases, and >> except for the fact that we have "always done it like this"). But quite >> likely I am missing reasons, so please bring them up. >> >> But I think there are certainly use cases where 2D blocks can be useful, >> but typically "external" (but nonetheless important) use cases: conversion >> to/from numpy, xarray, etc. A typical example that has recently come up is >> scikit-learn, where they want to have a cheap dataframe <-> numpy array >> roundtrip for use in their pipelines. >> However, I personally think there are possible ways that we can still >> accommodate for those use cases, with some effort, while still having 1D >> Blocks in pandas itself. So IMO this is not sufficient to warrant the >> complexity of 2D blocks in pandas. >> (but will stop here, as this mail is getting already long ..). >> >> Joris >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhochy at gmail.com Tue May 26 06:28:16 2020 From: xhochy at gmail.com (Uwe L. Korn) Date: Tue, 26 May 2020 12:28:16 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: Hello all, thanks Joris for starting this thread. For myself, I struggle a bit to understand the cases that are made for the BlockManager benefits. The examples are mostly operations that act on two full DataFrames like "df1 + df2" or come from the fact that one wants to keep a single-type 2D matrix together with column labels but not acutally make use of pandas functionality afterwards. In the code I write on a day-to-day basis, we don't have these use cases thus I'm struggling to understand the real-world benefit of having these operations supported as efficiently as possible in pandas. Even when using scikit-learn pipelines, we have for as long as possible heterogeneously typed DataFrames and only convert to a single-type matrix as late as possible. Thus can anyone enlighten me in which real-world use cases this needs to supported in pandas? Best Uwe Am Di., 26. Mai 2020 um 10:55 Uhr schrieb Joris Van den Bossche < jorisvandenbossche at gmail.com>: > On Tue, 26 May 2020 at 00:46, Brock Mendel wrote: > >> Thanks for writing this up, Joris. Assuming we go down this path, do you >> have an idea of how we get from here to there incrementally? i.e. >> presumably this wont just be one massive PR >> > > Yes, this is certainly not a one-PR change. I think there are multiple > options for working towards this, that are worth discussing. > > But personally, I would first like to focus on the "assuming we go down > this path" part. Let's discuss the pros and cons and trade-offs, and try to > turn assumptions in an agreed-upon roadmap. > (and of course, it's not because something is on our roadmap that it can't > be questioned and discussed again in the future, as we are also doing now). > > --- > > Some thoughts on possible options: > > - We briefly discussed before the idea of using (nullable) extension > dtypes for all dtypes by default in pandas 2.0. If we strive towards that, > and assuming we keep the current 1D-restriction on ExtensionBlock, then we > would "automatically" get a BlockManager with 1D blocks. And we could then > focus on optimizing some code paths (eg constructing a new block) > specifically for the case of 1D ExtensionBlocks. > - A "consolidation policy" option similarly as in the branch discussed in > https://github.com/pandas-dev/pandas/issues/10556. Right now, that branch > still uses 2D blocks (but separate 2D blocks of shape (1, n) per column) > and not actually 1D blocks. So we could add 1D versions of our numeric > blocks as well. But that would probably add a lot of complexity, although > temporary, to the Blocks, so maybe not an ideal path forward. > - Add a version of the ExtensionBlock but that can work with numpy arrays > instead of extension arrays, or actually use the "PandasArrays" to store it > them in the existing ExtensionBlock (so to already start using the existing > 1D blocks without requiring all dtypes to be extension dtypes). > > Those are all about BlockManager with 1D blocks. Once we only have 1D > Blocks, I suppose there are many things we could simplify in the current > BlockManager. The intermediate step of the current BlockManager with 1D > blocks might not be an optimal situation, but seems the easiest as > intermediate goal in practice. > > It probably also depends on how much "backwards compatibility" or > "transition period" we want to provide. > > >> On Mon, May 25, 2020 at 2:39 PM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> Hi list, >>> >>> Rewriting the BlockManager based on a simpler collection of 1D-arrays is >>> actually on our roadmap (see here >>> ), >>> and I also touched on it in a mailing list discussion about pandas 2.0 >>> earlier this year (see here >>> >>> ). >>> >>> But since the topic came up again recently at the last online dev >>> meeting (and also Uwe Korn who wrote a nice blog post >>> about >>> this yesterday), I thought to do a write-up of my thoughts on why I think >>> we should actually move towards a simpler, non-consolidating BlockManager >>> with 1D blocks. >>> >>> >>> *Simplication of the internals* >>> >>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs) >>> because right now we have a lot of special cases for 1D EAs in the >>> internals. But to be clear: the additional complexity does not come from 1D >>> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D >>> blocks. >>> Solving this would require a consistent block dimension, and thus >>> removing this added complexity can be done in two ways: have all 1D blocks, >>> or have all 2D blocks. >>> Just to say: IMO, this is not an argument in favor of 2D blocks / >>> consolidation. >>> >>> Moreover, when going with all 1D blocks, we cannot only remove the added >>> complexity from dealing with the mixture of 1D/2D blocks, we will *also* >>> be able to reduce the complexity of dealing with 2D blocks. A >>> BlockManager with 2D blocks is inherently more complex than with 1D blocks, >>> as one needs to deal with proper alignment of the blocks, a more complex >>> "placement" logic of the blocks, etc. >>> >>> I think we would be able to simplify the internals a lot by going with a >>> BlockManager as a store of 1D arrays. >>> >>> >>> *Performance* >>> >>> Performance is typically given as a reason to have consolidated, 2D >>> blocks. And of course, certain operations (especially row-wise operations, >>> or on dataframes with more columns as rows) will always be faster when done >>> on a 2D numpy array under the hood. >>> However, based on recent experimentation with this (eg triggered by the block-wise >>> frame ops PR , and see >>> also some benchmarks I justed posted in #10556 >>> >>> / this gist >>> ), >>> I also think that for many operations and with decent-sized dataframes, >>> this performance penalty is actually quite OK. >>> >>> Further, there are also operations that will *benefit* from 1D blocks. >>> First, operations that now involve aligning/splitting blocks, >>> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing >>> frame/frame operations column-wise is currently due to the consolidation in >>> the end). And operations like adding a column, concatting (with axis=1) or >>> merging dataframes will be much faster when no consolidation is needed. >>> >>> Personally, I am convinced that with some effort, we can get on-par or >>> sometimes even better performance with 1D blocks compared to the >>> performance we have now for those cases that 90+% of our users care about: >>> >>> - With limited effort optimizing the column-wise code paths in the >>> internals, we can get a long way. >>> - After that, if needed, we can still consider if parts of the >>> internals could be cythonized to further improve certain bottlenecks (and >>> actually cythonizing this will also be simpler for a simpler >>> non-consolidating block manager). >>> >>> >>> *Possibility to get better copy/view semantics* >>> >>> Pandas is badly known for how much it copies ("you need 10x the memory >>> available as the size of your dataframe"), and having 1D blocks will allow >>> us to address part of those concerns. >>> >>> *No consolidation = less copying.* Regularly consolidating introduces >>> copies, and thus removing consolidation will mean less copies. For example, >>> this would enable that you can actually add a single column to a dataframe >>> without having to copy to the full dataframe. >>> >>> *Copy / view semantics* Recently there has been discussion again around >>> whether selecting columns should be a copy or a view, and some other issues >>> were opened with questions about views/copies when slicing columns. In the >>> consolidated 2D block layout this will always be inherently messy, and >>> unpredictable (meaning: depending on the actual block layout, which means >>> in practice unpredictable for the user unaware of the block layout). >>> Going with a non-consolidated BlockManager should at least allow us to >>> get better / more understandable semantics around this. >>> >>> >>> ------------------------------ >>> >>> *So what are the reasons to have 2D blocks?* >>> >>> I personally don't directly see reasons to have 2D blocks *for pandas >>> itself* (apart from performance in certain row-wise use cases, and >>> except for the fact that we have "always done it like this"). But quite >>> likely I am missing reasons, so please bring them up. >>> >>> But I think there are certainly use cases where 2D blocks can be useful, >>> but typically "external" (but nonetheless important) use cases: conversion >>> to/from numpy, xarray, etc. A typical example that has recently come up is >>> scikit-learn, where they want to have a cheap dataframe <-> numpy array >>> roundtrip for use in their pipelines. >>> However, I personally think there are possible ways that we can still >>> accommodate for those use cases, with some effort, while still having 1D >>> Blocks in pandas itself. So IMO this is not sufficient to warrant the >>> complexity of 2D blocks in pandas. >>> (but will stop here, as this mail is getting already long ..). >>> >>> Joris >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Tue May 26 07:21:33 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Tue, 26 May 2020 06:21:33 -0500 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Thanks for those links! > > Personally, I see the "roundtrip conversion to/from sparse matrices" a bit > as in the same bucket as conversion to/from a 2D numpy array. > Yes, both are important use cases. But the question we need to ask > ourselves is still: is this important enough to hugely complicate the > pandas' internals and block several other improvements? It's a trade-off > that we need to make. > > Moreover, I think that we could accommodate the important part of those > use cases also with a column-store DataFrame, with some effort (but with > less complexity as a consolidated BlockManager). > > Focusing on scikit-learn: in the end, you mostly care about cheap > roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame > to carry feature labels in between steps of a pipeline, correct? > Such cheap roundtripping is only possible anyway if you have a single > dtype for all columns (which is typically the case after some > transformation step). So you don't necessarily need consolidated blocks > specifically, but rather the ability to store a *single* 2D array/matrix in > a DataFrame (so kind of a single 2D block). > > Thinking out loud here, didn't try anything in code: > > - We could make the DataFrame construction from a 2D array/matrix kind of > "lazy" (or have an option to do it like this): upon construction just store > the 2D array as is, and only once you perform an actual operation on it, > convert to a columnar store. And that would make it possible to still get > the 2D array back with zero-copy, if all you did was passing this DataFrame > to the next step of the pipeline. > - We could take the above a step further and try to preserve the 2D array > under the hood in some "easy" operations (but again, limited to a single 2D > block/array, not multiple consolidated blocks). This is actually similar to > the DataMatrix that pandas had a very long time ago. Of course this adds > back complexity, so this would need some more exploration to see if how > this would be possible (without duplicating a lot), and some buy-in from > people interested in this. > > I think the first option should be fairly easy to do, and should solve a > large part of the concerns for scikit-learn (I think?). > I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be. > I think the second idea is also interesting: IMO such a data structure > would be useful to have somewhere in the PyData ecosystem, and a worthwhile > discussion to think about where this could fit. Maybe the answer is simply: > use xarray for this use case (although there are still differences) ? That > are interesting discussions, but personally I would not complicate the core > pandas data model for heterogeneous dataframes to accommodate the > single-dtype + fixed number of columns use case. > The current prototype[1] accepts preserves both xarray and pandas data structures. [1]: https://github.com/scikit-learn/scikit-learn/pull/16772 > Joris > > On Tue, 26 May 2020 at 09:50, Adrin wrote: > >> Hi Joris, >> >> Thanks for the summary. I think another missing point is the roundtrip >> conversion to/from sparse matrices. >> There are some benchmarks and discussion here; >> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097 >> and here's some discussion on the pandas issue tracker: >> https://github.com/pandas-dev/pandas/issues/33182 >> and some benchmark by Tom, assuming pandas would accept a 2D sparse >> array: >> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615440896 >> >> What do you think of these usecases? >> >> Thanks, >> Adrin >> >> On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> Hi list, >>> >>> Rewriting the BlockManager based on a simpler collection of 1D-arrays is >>> actually on our roadmap (see here >>> ), >>> and I also touched on it in a mailing list discussion about pandas 2.0 >>> earlier this year (see here >>> >>> ). >>> >>> But since the topic came up again recently at the last online dev >>> meeting (and also Uwe Korn who wrote a nice blog post >>> about >>> this yesterday), I thought to do a write-up of my thoughts on why I think >>> we should actually move towards a simpler, non-consolidating BlockManager >>> with 1D blocks. >>> >>> >>> *Simplication of the internals* >>> >>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs) >>> because right now we have a lot of special cases for 1D EAs in the >>> internals. But to be clear: the additional complexity does not come from 1D >>> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D >>> blocks. >>> Solving this would require a consistent block dimension, and thus >>> removing this added complexity can be done in two ways: have all 1D blocks, >>> or have all 2D blocks. >>> Just to say: IMO, this is not an argument in favor of 2D blocks / >>> consolidation. >>> >>> Moreover, when going with all 1D blocks, we cannot only remove the added >>> complexity from dealing with the mixture of 1D/2D blocks, we will *also* >>> be able to reduce the complexity of dealing with 2D blocks. A >>> BlockManager with 2D blocks is inherently more complex than with 1D blocks, >>> as one needs to deal with proper alignment of the blocks, a more complex >>> "placement" logic of the blocks, etc. >>> >>> I think we would be able to simplify the internals a lot by going with a >>> BlockManager as a store of 1D arrays. >>> >>> >>> *Performance* >>> >>> Performance is typically given as a reason to have consolidated, 2D >>> blocks. And of course, certain operations (especially row-wise operations, >>> or on dataframes with more columns as rows) will always be faster when done >>> on a 2D numpy array under the hood. >>> However, based on recent experimentation with this (eg triggered by the block-wise >>> frame ops PR , and see >>> also some benchmarks I justed posted in #10556 >>> >>> / this gist >>> ), >>> I also think that for many operations and with decent-sized dataframes, >>> this performance penalty is actually quite OK. >>> >>> Further, there are also operations that will *benefit* from 1D blocks. >>> First, operations that now involve aligning/splitting blocks, >>> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing >>> frame/frame operations column-wise is currently due to the consolidation in >>> the end). And operations like adding a column, concatting (with axis=1) or >>> merging dataframes will be much faster when no consolidation is needed. >>> >>> Personally, I am convinced that with some effort, we can get on-par or >>> sometimes even better performance with 1D blocks compared to the >>> performance we have now for those cases that 90+% of our users care about: >>> >>> - With limited effort optimizing the column-wise code paths in the >>> internals, we can get a long way. >>> - After that, if needed, we can still consider if parts of the >>> internals could be cythonized to further improve certain bottlenecks (and >>> actually cythonizing this will also be simpler for a simpler >>> non-consolidating block manager). >>> >>> >>> *Possibility to get better copy/view semantics* >>> >>> Pandas is badly known for how much it copies ("you need 10x the memory >>> available as the size of your dataframe"), and having 1D blocks will allow >>> us to address part of those concerns. >>> >>> *No consolidation = less copying.* Regularly consolidating introduces >>> copies, and thus removing consolidation will mean less copies. For example, >>> this would enable that you can actually add a single column to a dataframe >>> without having to copy to the full dataframe. >>> >>> *Copy / view semantics* Recently there has been discussion again around >>> whether selecting columns should be a copy or a view, and some other issues >>> were opened with questions about views/copies when slicing columns. In the >>> consolidated 2D block layout this will always be inherently messy, and >>> unpredictable (meaning: depending on the actual block layout, which means >>> in practice unpredictable for the user unaware of the block layout). >>> Going with a non-consolidated BlockManager should at least allow us to >>> get better / more understandable semantics around this. >>> >>> >>> ------------------------------ >>> >>> *So what are the reasons to have 2D blocks?* >>> >>> I personally don't directly see reasons to have 2D blocks *for pandas >>> itself* (apart from performance in certain row-wise use cases, and >>> except for the fact that we have "always done it like this"). But quite >>> likely I am missing reasons, so please bring them up. >>> >>> But I think there are certainly use cases where 2D blocks can be useful, >>> but typically "external" (but nonetheless important) use cases: conversion >>> to/from numpy, xarray, etc. A typical example that has recently come up is >>> scikit-learn, where they want to have a cheap dataframe <-> numpy array >>> roundtrip for use in their pipelines. >>> However, I personally think there are possible ways that we can still >>> accommodate for those use cases, with some effort, while still having 1D >>> Blocks in pandas itself. So IMO this is not sufficient to warrant the >>> complexity of 2D blocks in pandas. >>> (but will stop here, as this mail is getting already long ..). >>> >>> Joris >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Tue May 26 08:16:53 2020 From: jeffreback at gmail.com (Jeff Reback) Date: Tue, 26 May 2020 08:16:53 -0400 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: <51BBA340-82F5-4F65-A51E-6527893678E0@gmail.com> A little historical perspective 10 years ago the standard input to a Dataframe was a single dtype 2D numpy array. This provides the following nice properties: - 0 cost construction, you can simply wrap Dataframe around the input with very little overhead. This provides a labeled array interface, gaining pandas users - very fast reductions; the block is passed to numpy directly for the reductions; numpy can then reduce with aligned memory access - almost all operations in pandas coerced to float64 on operations The block manager is optimized for this case as this was the original DataMatrix. It serves its purpose pretty well. In the last few years things have changed in the following ways: - dict of 1D numpy arrays is by far the most common construction - heterogenous dtypes have grown quite a bit, eg it?s now very common to use int8, float32; these are also preserved pretty well by pandas operations - non numpy backed dtypes are increasingly common To me removing the block manager is not about performance, rather about simplifying the code and mental model, though we should be mindful of construction from 2D inputs will require splitting and thus be not cheap (note that you can view the 1D slices but these are not memory aligned); this is a typical trap that folks get into; 1D looks all rosy but it all depends on usecase. I think it would be ok for pandas to move to dict of columns and simply document the non performing cases (eg very wide single dtypes or 2D construction); I suppose it?s also possible to reinvent the DataMatrix in a limited form but that of course adds complexity and would like to see that after a refactor. my 3c Jeff On May 26, 2020, at 7:22 AM, Tom Augspurger wrote: > > ? > > >> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche wrote: >> Thanks for those links! >> >> Personally, I see the "roundtrip conversion to/from sparse matrices" a bit as in the same bucket as conversion to/from a 2D numpy array. >> Yes, both are important use cases. But the question we need to ask ourselves is still: is this important enough to hugely complicate the pandas' internals and block several other improvements? It's a trade-off that we need to make. >> >> Moreover, I think that we could accommodate the important part of those use cases also with a column-store DataFrame, with some effort (but with less complexity as a consolidated BlockManager). >> >> Focusing on scikit-learn: in the end, you mostly care about cheap roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame to carry feature labels in between steps of a pipeline, correct? >> Such cheap roundtripping is only possible anyway if you have a single dtype for all columns (which is typically the case after some transformation step). So you don't necessarily need consolidated blocks specifically, but rather the ability to store a *single* 2D array/matrix in a DataFrame (so kind of a single 2D block). >> >> Thinking out loud here, didn't try anything in code: >> >> - We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline. >> - We could take the above a step further and try to preserve the 2D array under the hood in some "easy" operations (but again, limited to a single 2D block/array, not multiple consolidated blocks). This is actually similar to the DataMatrix that pandas had a very long time ago. Of course this adds back complexity, so this would need some more exploration to see if how this would be possible (without duplicating a lot), and some buy-in from people interested in this. >> >> I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?). > > I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be. > >> I think the second idea is also interesting: IMO such a data structure would be useful to have somewhere in the PyData ecosystem, and a worthwhile discussion to think about where this could fit. Maybe the answer is simply: use xarray for this use case (although there are still differences) ? That are interesting discussions, but personally I would not complicate the core pandas data model for heterogeneous dataframes to accommodate the single-dtype + fixed number of columns use case. > > The current prototype[1] accepts preserves both xarray and pandas data structures. > > [1]: https://github.com/scikit-learn/scikit-learn/pull/16772 > >> Joris >> >>> On Tue, 26 May 2020 at 09:50, Adrin wrote: >>> Hi Joris, >>> >>> Thanks for the summary. I think another missing point is the roundtrip conversion to/from sparse matrices. >>> There are some benchmarks and discussion here; https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097 >>> and here's some discussion on the pandas issue tracker: https://github.com/pandas-dev/pandas/issues/33182 >>> and some benchmark by Tom, assuming pandas would accept a 2D sparse array: https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615440896 >>> >>> What do you think of these usecases? >>> >>> Thanks, >>> Adrin >>> >>>> On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche wrote: >>>> Hi list, >>>> >>>> Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see here), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see here). >>>> >>>> But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice blog post about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks. >>>> >>>> >>>> >>>> Simplication of the internals >>>> >>>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks. >>>> Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks. >>>> Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation. >>>> >>>> Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will also be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc. >>>> >>>> I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays. >>>> >>>> >>>> >>>> Performance >>>> >>>> Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood. >>>> However, based on recent experimentation with this (eg triggered by the block-wise frame ops PR, and see also some benchmarks I justed posted in #10556 / this gist), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK. >>>> >>>> Further, there are also operations that will benefit from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed. >>>> >>>> Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about: >>>> >>>> With limited effort optimizing the column-wise code paths in the internals, we can get a long way. >>>> After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager). >>>> >>>> Possibility to get better copy/view semantics >>>> >>>> Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns. >>>> >>>> No consolidation = less copying. Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe. >>>> >>>> Copy / view semantics Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout). >>>> Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this. >>>> >>>> >>>> >>>> So what are the reasons to have 2D blocks? >>>> >>>> I personally don't directly see reasons to have 2D blocks for pandas itself (apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up. >>>> >>>> But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines. >>>> However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. >>>> (but will stop here, as this mail is getting already long ..). >>>> >>>> >>>> Joris >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Tue May 26 10:13:44 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Tue, 26 May 2020 07:13:44 -0700 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: <51BBA340-82F5-4F65-A51E-6527893678E0@gmail.com> References: <51BBA340-82F5-4F65-A51E-6527893678E0@gmail.com> Message-ID: >> Assuming we go down this path, do you have an idea of how we get from here to there incrementally? i.e. presumably this wont just be one massive PR > [...] I would first like to focus on the "assuming we go down this path" part. Let's discuss the pros and cons and trade-offs, and try to turn assumptions in an agreed-upon roadmap. [...] I think understanding the difficulty/feasibility of the implementation is a pretty important part of the pros/cons. Looking back at #10556, I'm wondering if we could disable _most_ consolidation, e.g. only consolidate when making copies anyway, which might be a never-break-views policy. From a user standpoint would that achieve much/most of th benefits here? On Tue, May 26, 2020 at 5:17 AM Jeff Reback wrote: > A little historical perspective > > 10 years ago the standard input to a Dataframe was a single dtype 2D numpy > array. This provides the following nice properties: > > - 0 cost construction, you can simply wrap Dataframe around the input with > very little overhead. This provides a labeled array interface, gaining > pandas users > - very fast reductions; the block is passed to numpy directly for the > reductions; numpy can then reduce with aligned memory access > - almost all operations in pandas coerced to float64 on operations > > The block manager is optimized for this case as this was the original > DataMatrix. It serves its purpose pretty well. > > In the last few years things have changed in the following ways: > > - dict of 1D numpy arrays is by far the most common construction > - heterogenous dtypes have grown quite a bit, eg it?s now very common to > use int8, float32; these are also preserved pretty well by pandas > operations > - non numpy backed dtypes are increasingly common > > To me removing the block manager is not about performance, rather about > simplifying the code and mental model, though we should be mindful of > construction from 2D inputs will require splitting and thus be not cheap > (note that you can view the 1D slices but these are not memory aligned); > this is a typical trap that folks get into; 1D looks all rosy but it all > depends on usecase. > > I think it would be ok for pandas to move to dict of columns and simply > document the non performing cases (eg very wide single dtypes or 2D > construction); > > I suppose it?s also possible to reinvent the DataMatrix in a limited form > but that of course adds complexity and would like to see that after a > refactor. > > my 3c > > Jeff > > On May 26, 2020, at 7:22 AM, Tom Augspurger > wrote: > > > ? > > > On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Thanks for those links! >> >> Personally, I see the "roundtrip conversion to/from sparse matrices" a >> bit as in the same bucket as conversion to/from a 2D numpy array. >> Yes, both are important use cases. But the question we need to ask >> ourselves is still: is this important enough to hugely complicate the >> pandas' internals and block several other improvements? It's a trade-off >> that we need to make. >> >> Moreover, I think that we could accommodate the important part of those >> use cases also with a column-store DataFrame, with some effort (but with >> less complexity as a consolidated BlockManager). >> >> Focusing on scikit-learn: in the end, you mostly care about cheap >> roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame >> to carry feature labels in between steps of a pipeline, correct? >> Such cheap roundtripping is only possible anyway if you have a single >> dtype for all columns (which is typically the case after some >> transformation step). So you don't necessarily need consolidated blocks >> specifically, but rather the ability to store a *single* 2D array/matrix in >> a DataFrame (so kind of a single 2D block). >> >> Thinking out loud here, didn't try anything in code: >> >> - We could make the DataFrame construction from a 2D array/matrix kind of >> "lazy" (or have an option to do it like this): upon construction just store >> the 2D array as is, and only once you perform an actual operation on it, >> convert to a columnar store. And that would make it possible to still get >> the 2D array back with zero-copy, if all you did was passing this DataFrame >> to the next step of the pipeline. >> - We could take the above a step further and try to preserve the 2D array >> under the hood in some "easy" operations (but again, limited to a single 2D >> block/array, not multiple consolidated blocks). This is actually similar to >> the DataMatrix that pandas had a very long time ago. Of course this adds >> back complexity, so this would need some more exploration to see if how >> this would be possible (without duplicating a lot), and some buy-in from >> people interested in this. >> >> I think the first option should be fairly easy to do, and should solve a >> large part of the concerns for scikit-learn (I think?). >> > > I think the first option would solve that use case for scikit-learn. It > sounds feasible, but I'm not sure how easy it would be. > > >> I think the second idea is also interesting: IMO such a data structure >> would be useful to have somewhere in the PyData ecosystem, and a worthwhile >> discussion to think about where this could fit. Maybe the answer is simply: >> use xarray for this use case (although there are still differences) ? That >> are interesting discussions, but personally I would not complicate the core >> pandas data model for heterogeneous dataframes to accommodate the >> single-dtype + fixed number of columns use case. >> > > The current prototype[1] accepts preserves both xarray and pandas data > structures. > > [1]: https://github.com/scikit-learn/scikit-learn/pull/16772 > > >> Joris >> >> On Tue, 26 May 2020 at 09:50, Adrin wrote: >> >>> Hi Joris, >>> >>> Thanks for the summary. I think another missing point is the roundtrip >>> conversion to/from sparse matrices. >>> There are some benchmarks and discussion here; >>> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097 >>> and here's some discussion on the pandas issue tracker: >>> https://github.com/pandas-dev/pandas/issues/33182 >>> and some benchmark by Tom, assuming pandas would accept a 2D sparse >>> array: >>> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615440896 >>> >>> What do you think of these usecases? >>> >>> Thanks, >>> Adrin >>> >>> On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche < >>> jorisvandenbossche at gmail.com> wrote: >>> >>>> Hi list, >>>> >>>> Rewriting the BlockManager based on a simpler collection of 1D-arrays >>>> is actually on our roadmap (see here >>>> ), >>>> and I also touched on it in a mailing list discussion about pandas 2.0 >>>> earlier this year (see here >>>> >>>> ). >>>> >>>> But since the topic came up again recently at the last online dev >>>> meeting (and also Uwe Korn who wrote a nice blog post >>>> about >>>> this yesterday), I thought to do a write-up of my thoughts on why I think >>>> we should actually move towards a simpler, non-consolidating BlockManager >>>> with 1D blocks. >>>> >>>> >>>> *Simplication of the internals* >>>> >>>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs) >>>> because right now we have a lot of special cases for 1D EAs in the >>>> internals. But to be clear: the additional complexity does not come from 1D >>>> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D >>>> blocks. >>>> Solving this would require a consistent block dimension, and thus >>>> removing this added complexity can be done in two ways: have all 1D blocks, >>>> or have all 2D blocks. >>>> Just to say: IMO, this is not an argument in favor of 2D blocks / >>>> consolidation. >>>> >>>> Moreover, when going with all 1D blocks, we cannot only remove the >>>> added complexity from dealing with the mixture of 1D/2D blocks, we will >>>> *also* be able to reduce the complexity of dealing with 2D blocks. A >>>> BlockManager with 2D blocks is inherently more complex than with 1D blocks, >>>> as one needs to deal with proper alignment of the blocks, a more complex >>>> "placement" logic of the blocks, etc. >>>> >>>> I think we would be able to simplify the internals a lot by going with >>>> a BlockManager as a store of 1D arrays. >>>> >>>> >>>> *Performance* >>>> >>>> Performance is typically given as a reason to have consolidated, 2D >>>> blocks. And of course, certain operations (especially row-wise operations, >>>> or on dataframes with more columns as rows) will always be faster when done >>>> on a 2D numpy array under the hood. >>>> However, based on recent experimentation with this (eg triggered by the >>>> block-wise frame ops PR >>>> , and see also some >>>> benchmarks I justed posted in #10556 >>>> >>>> / this gist >>>> ), >>>> I also think that for many operations and with decent-sized dataframes, >>>> this performance penalty is actually quite OK. >>>> >>>> Further, there are also operations that will *benefit* from 1D blocks. >>>> First, operations that now involve aligning/splitting blocks, >>>> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing >>>> frame/frame operations column-wise is currently due to the consolidation in >>>> the end). And operations like adding a column, concatting (with axis=1) or >>>> merging dataframes will be much faster when no consolidation is needed. >>>> >>>> Personally, I am convinced that with some effort, we can get on-par or >>>> sometimes even better performance with 1D blocks compared to the >>>> performance we have now for those cases that 90+% of our users care about: >>>> >>>> - With limited effort optimizing the column-wise code paths in the >>>> internals, we can get a long way. >>>> - After that, if needed, we can still consider if parts of the >>>> internals could be cythonized to further improve certain bottlenecks (and >>>> actually cythonizing this will also be simpler for a simpler >>>> non-consolidating block manager). >>>> >>>> >>>> *Possibility to get better copy/view semantics* >>>> >>>> Pandas is badly known for how much it copies ("you need 10x the memory >>>> available as the size of your dataframe"), and having 1D blocks will allow >>>> us to address part of those concerns. >>>> >>>> *No consolidation = less copying.* Regularly consolidating introduces >>>> copies, and thus removing consolidation will mean less copies. For example, >>>> this would enable that you can actually add a single column to a dataframe >>>> without having to copy to the full dataframe. >>>> >>>> *Copy / view semantics* Recently there has been discussion again >>>> around whether selecting columns should be a copy or a view, and some other >>>> issues were opened with questions about views/copies when slicing columns. >>>> In the consolidated 2D block layout this will always be inherently messy, >>>> and unpredictable (meaning: depending on the actual block layout, which >>>> means in practice unpredictable for the user unaware of the block layout). >>>> Going with a non-consolidated BlockManager should at least allow us to >>>> get better / more understandable semantics around this. >>>> >>>> >>>> ------------------------------ >>>> >>>> *So what are the reasons to have 2D blocks?* >>>> >>>> I personally don't directly see reasons to have 2D blocks *for pandas >>>> itself* (apart from performance in certain row-wise use cases, and >>>> except for the fact that we have "always done it like this"). But quite >>>> likely I am missing reasons, so please bring them up. >>>> >>>> But I think there are certainly use cases where 2D blocks can be >>>> useful, but typically "external" (but nonetheless important) use cases: >>>> conversion to/from numpy, xarray, etc. A typical example that has recently >>>> come up is scikit-learn, where they want to have a cheap dataframe <-> >>>> numpy array roundtrip for use in their pipelines. >>>> However, I personally think there are possible ways that we can still >>>> accommodate for those use cases, with some effort, while still having 1D >>>> Blocks in pandas itself. So IMO this is not sufficient to warrant the >>>> complexity of 2D blocks in pandas. >>>> (but will stop here, as this mail is getting already long ..). >>>> >>>> Joris >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue May 26 15:34:52 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 26 May 2020 21:34:52 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: On Tue, 26 May 2020 at 13:21, Tom Augspurger wrote: > > On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> - We could make the DataFrame construction from a 2D array/matrix kind of >> "lazy" (or have an option to do it like this): upon construction just store >> the 2D array as is, and only once you perform an actual operation on it, >> convert to a columnar store. And that would make it possible to still get >> the 2D array back with zero-copy, if all you did was passing this DataFrame >> to the next step of the pipeline. >> >> I think the first option should be fairly easy to do, and should solve a >> large part of the concerns for scikit-learn (I think?). >> > > I think the first option would solve that use case for scikit-learn. It > sounds feasible, but I'm not sure how easy it would be. > > A quick, ugly proof-of-concept: https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188 It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray: In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3))) In [2]: df._mgr_data Out[2]: (array([[ 1.52971972e-01, -5.69204971e-01, 5.54430115e-01], [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00], [ 7.05185110e-01, -1.53009348e-03, 1.54260335e+00], [-4.60590231e-01, -3.85364427e-01, 1.80760103e+00]]), RangeIndex(start=0, stop=4, step=1), RangeIndex(start=0, stop=3, step=1)) And once you do something with the dataframe, such as printing or calculating something, the BlockManager gets only created at this step: In [3]: df Out[3]: Initializing !!! 0 1 2 0 0.152972 -0.569205 0.554430 1 -1.099161 -1.163154 -1.510711 2 0.705185 -0.001530 1.542603 3 -0.460590 -0.385364 1.807601 In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3))) In [5]: df.mean() Initializing !!! Out[5]: 0 0.397243 1 0.269996 2 -0.454929 dtype: float64 There are of course many things missing (validation of the input to init_lazy, potentially being able to access df.index/df.columns without initializing the block manager, hooking this up in __array__, what with pickling?, ...) But just to illustrate the idea. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Tue May 26 15:42:44 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Tue, 26 May 2020 14:42:44 -0500 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: Thanks for verifying the feasibility. Validation is a bit tricky, but I'd hope that we can delay everything except the splitting / forming of blocks. That may result in some non-obvious performance quirks, but at least of the simple case of `data` being an ndarray and index / columns not forcing any reindexing, I'm hopeful that it's not too bad. On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > On Tue, 26 May 2020 at 13:21, Tom Augspurger > wrote: > >> >> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> - We could make the DataFrame construction from a 2D array/matrix kind >>> of "lazy" (or have an option to do it like this): upon construction just >>> store the 2D array as is, and only once you perform an actual operation on >>> it, convert to a columnar store. And that would make it possible to still >>> get the 2D array back with zero-copy, if all you did was passing this >>> DataFrame to the next step of the pipeline. >>> >>> I think the first option should be fairly easy to do, and should solve a >>> large part of the concerns for scikit-learn (I think?). >>> >> >> I think the first option would solve that use case for scikit-learn. It >> sounds feasible, but I'm not sure how easy it would be. >> >> > A quick, ugly proof-of-concept: > https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188 > > It allows to create a "DataFrame" from an ndarray without creating a > BlockManager, and it allows accessing this original ndarray: > > In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), > (pd.RangeIndex(4), pd.RangeIndex(3))) > > In [2]: df._mgr_data > Out[2]: > (array([[ 1.52971972e-01, -5.69204971e-01, 5.54430115e-01], > [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00], > [ 7.05185110e-01, -1.53009348e-03, 1.54260335e+00], > [-4.60590231e-01, -3.85364427e-01, 1.80760103e+00]]), > RangeIndex(start=0, stop=4, step=1), > RangeIndex(start=0, stop=3, step=1)) > > And once you do something with the dataframe, such as printing or > calculating something, the BlockManager gets only created at this step: > > In [3]: df > Out[3]: Initializing !!! > > 0 1 2 > 0 0.152972 -0.569205 0.554430 > 1 -1.099161 -1.163154 -1.510711 > 2 0.705185 -0.001530 1.542603 > 3 -0.460590 -0.385364 1.807601 > > In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), > (pd.RangeIndex(4), pd.RangeIndex(3))) > > In [5]: df.mean() > Initializing !!! > Out[5]: > 0 0.397243 > 1 0.269996 > 2 -0.454929 > dtype: float64 > > There are of course many things missing (validation of the input to > init_lazy, potentially being able to access df.index/df.columns without > initializing the block manager, hooking this up in __array__, what with > pickling?, ...) > But just to illustrate the idea. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue May 26 15:44:43 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 26 May 2020 21:44:43 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <51BBA340-82F5-4F65-A51E-6527893678E0@gmail.com> Message-ID: On Tue, 26 May 2020 at 16:14, Brock Mendel wrote: > >> Assuming we go down this path, do you have an idea of how we get from > here to there incrementally? i.e. presumably this wont just be one massive > PR > > [...] I would first like to focus on the "assuming we go down this > path" part. Let's discuss the pros and cons and trade-offs, and try to turn > assumptions in an agreed-upon roadmap. [...] > > I think understanding the difficulty/feasibility of the implementation is > a pretty important part of the pros/cons. > That's true. Personally I think there are enough options to do it to not have to worry about the "how" too much, but for sure it will be a lot of work to do it properly (so rather the "who is going to do this"). > Looking back at #10556, I'm wondering if we could disable _most_ > consolidation, e.g. only consolidate when making copies anyway, which might > be a never-break-views policy. From a user standpoint would that achieve > much/most of th benefits here? > That could certainly alleviate some of the drawbacks of the consolidated BlockManager regarding its copying behaviour (but not necessarily regarding the transparency / understandability of it, I would say). But for example for the "complexity of the internals" argument, I think this would rather make it worse. Now, you at least know (after ensuring consolidation) that you have only a single block for a certain dtype. Still having many, potentially-but-not-always consolidated 2D blocks will make it more difficult to optimize the situation of non-consolidated / 1D blocks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue May 26 15:48:39 2020 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 26 May 2020 14:48:39 -0500 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: Something to add here (in favor of removing the BM) -- and apologies if it's already mentioned in a different form: It is very, very difficult for third party code to construct heterogeneously-typed DataFrames without triggering a memory doubling. To give you an example what I mean, in Apache Arrow, we painstakingly implemented block consolidation in C++ [1] so that we can construct a DataFrame that won't suddenly double memory the first time that a user interacts with it. So the possibility of users having an OOM on their first interaction with an object they created is not great. If avoiding it for library developers were easy then perhaps it would be less of an issue, but avoiding the doubling requires advanced knowledge of pandas's internals. Looking back 9-10 years, the primary motivations I had for creating the BlockManager in the first place don't persuade me anymore: * pandas's success was still very much coupled to vectorized operations on wide row-major data (e.g. as present in certain sectors of the financial industry). I don't think this represents the majority of pandas users now * In 2011 I was uncomfortable writing significant compiled code. Many of the performance issues that the BM tried to ameliorate are non-issues if you're OK writing non-trivial C/C++ code to deal with row-level interactions. Even if there were a 50% performance regression on some of these operations that are faster with 2D blocks because of row-major vs. column-major memory layout, that still seems worth it for the vast code simplification and the memory-use-predictability benefits that others have articulated already. - Wes [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche wrote: > > On Tue, 26 May 2020 at 13:21, Tom Augspurger wrote: >> >> >> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche wrote: >>> >>> - We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline. >>> >>> I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?). >> >> >> I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be. >> > > A quick, ugly proof-of-concept: https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188 > > It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray: > > In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3))) > > In [2]: df._mgr_data > Out[2]: > (array([[ 1.52971972e-01, -5.69204971e-01, 5.54430115e-01], > [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00], > [ 7.05185110e-01, -1.53009348e-03, 1.54260335e+00], > [-4.60590231e-01, -3.85364427e-01, 1.80760103e+00]]), > RangeIndex(start=0, stop=4, step=1), > RangeIndex(start=0, stop=3, step=1)) > > And once you do something with the dataframe, such as printing or calculating something, the BlockManager gets only created at this step: > > In [3]: df > Out[3]: Initializing !!! > > 0 1 2 > 0 0.152972 -0.569205 0.554430 > 1 -1.099161 -1.163154 -1.510711 > 2 0.705185 -0.001530 1.542603 > 3 -0.460590 -0.385364 1.807601 > > In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3))) > > In [5]: df.mean() > Initializing !!! > Out[5]: > 0 0.397243 > 1 0.269996 > 2 -0.454929 > dtype: float64 > > There are of course many things missing (validation of the input to init_lazy, potentially being able to access df.index/df.columns without initializing the block manager, hooking this up in __array__, what with pickling?, ...) > But just to illustrate the idea. > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From jbrockmendel at gmail.com Tue May 26 16:49:41 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Tue, 26 May 2020 13:49:41 -0700 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: > It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray: This is a neat proof of concept, but it cuts against the "decreases complexity" argument. Is there a viable way to quantify (even very roughly) the complexity effect of going all-1D? A couple ideas for ways to simplify this decision-making problem: 1) ATM there are a handful of places outside of core.internals where we call consolidate/consolidate_inplace. If we can refactor those away, we can focus on the BlockManager in (closer-to-)isolation. 2) IIUC going all-1D will cause column indexing to always return views. Elsewhere you have noted that this is a breaking API change which merited discussion in its own right. xref #33780 . My takeaway from this part of the last dev call was that people were generally positive on the all-views idea, but were wary of how to handle the potential deprecation. On Tue, May 26, 2020 at 12:49 PM Wes McKinney wrote: > Something to add here (in favor of removing the BM) -- and apologies > if it's already mentioned in a different form: > > It is very, very difficult for third party code to construct > heterogeneously-typed DataFrames without triggering a memory doubling. > To give you an example what I mean, in Apache Arrow, we painstakingly > implemented block consolidation in C++ [1] so that we can construct a > DataFrame that won't suddenly double memory the first time that a user > interacts with it. So the possibility of users having an OOM on their > first interaction with an object they created is not great. If > avoiding it for library developers were easy then perhaps it would be > less of an issue, but avoiding the doubling requires advanced > knowledge of pandas's internals. > > Looking back 9-10 years, the primary motivations I had for creating > the BlockManager in the first place don't persuade me anymore: > > * pandas's success was still very much coupled to vectorized > operations on wide row-major data (e.g. as present in certain sectors > of the financial industry). I don't think this represents the majority > of pandas users now > * In 2011 I was uncomfortable writing significant compiled code. Many > of the performance issues that the BM tried to ameliorate are > non-issues if you're OK writing non-trivial C/C++ code to deal with > row-level interactions. Even if there were a 50% performance > regression on some of these operations that are faster with 2D blocks > because of row-major vs. column-major memory layout, that still seems > worth it for the vast code simplification and the > memory-use-predictability benefits that others have articulated > already. > > - Wes > > [1]: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc > > On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche > wrote: > > > > On Tue, 26 May 2020 at 13:21, Tom Augspurger > wrote: > >> > >> > >> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >>> > >>> - We could make the DataFrame construction from a 2D array/matrix kind > of "lazy" (or have an option to do it like this): upon construction just > store the 2D array as is, and only once you perform an actual operation on > it, convert to a columnar store. And that would make it possible to still > get the 2D array back with zero-copy, if all you did was passing this > DataFrame to the next step of the pipeline. > >>> > >>> I think the first option should be fairly easy to do, and should solve > a large part of the concerns for scikit-learn (I think?). > >> > >> > >> I think the first option would solve that use case for scikit-learn. It > sounds feasible, but I'm not sure how easy it would be. > >> > > > > A quick, ugly proof-of-concept: > https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188 > > > > It allows to create a "DataFrame" from an ndarray without creating a > BlockManager, and it allows accessing this original ndarray: > > > > In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), > (pd.RangeIndex(4), pd.RangeIndex(3))) > > > > In [2]: df._mgr_data > > Out[2]: > > (array([[ 1.52971972e-01, -5.69204971e-01, 5.54430115e-01], > > [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00], > > [ 7.05185110e-01, -1.53009348e-03, 1.54260335e+00], > > [-4.60590231e-01, -3.85364427e-01, 1.80760103e+00]]), > > RangeIndex(start=0, stop=4, step=1), > > RangeIndex(start=0, stop=3, step=1)) > > > > And once you do something with the dataframe, such as printing or > calculating something, the BlockManager gets only created at this step: > > > > In [3]: df > > Out[3]: Initializing !!! > > > > 0 1 2 > > 0 0.152972 -0.569205 0.554430 > > 1 -1.099161 -1.163154 -1.510711 > > 2 0.705185 -0.001530 1.542603 > > 3 -0.460590 -0.385364 1.807601 > > > > In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), > (pd.RangeIndex(4), pd.RangeIndex(3))) > > > > In [5]: df.mean() > > Initializing !!! > > Out[5]: > > 0 0.397243 > > 1 0.269996 > > 2 -0.454929 > > dtype: float64 > > > > There are of course many things missing (validation of the input to > init_lazy, potentially being able to access df.index/df.columns without > initializing the block manager, hooking this up in __array__, what with > pickling?, ...) > > But just to illustrate the idea. > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Tue May 26 16:58:17 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Tue, 26 May 2020 15:58:17 -0500 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: On Tue, May 26, 2020 at 3:50 PM Brock Mendel wrote: > > It allows to create a "DataFrame" from an ndarray without creating a > BlockManager, and it allows accessing this original ndarray: > > This is a neat proof of concept, but it cuts against the "decreases > complexity" argument. Is there a viable way to quantify (even very > roughly) the complexity effect of going all-1D? > That complexity is at least localized to a single attribute. That's quite different from the 1D & 2D blocks situation, where many methods (though fewer than a year ago) need to be concerned with whether the array in a block is 1D or 2D, or whether the DataFrame is consolidated, homogenous, ... > A couple ideas for ways to simplify this decision-making problem: > > 1) ATM there are a handful of places outside of core.internals where we > call consolidate/consolidate_inplace. If we can refactor those away, we > can focus on the BlockManager in (closer-to-)isolation. > If possible, isolating consolidation to `core.internals` sounds like a generally useful cleanup, regardless of whether we pursue the larger changes. > 2) IIUC going all-1D will cause column indexing to always return views. > Elsewhere you have noted that this is a breaking API change which merited > discussion in its own right. xref #33780 > . My takeaway from > this part of the last dev call was that people were generally positive on > the all-views idea, but were wary of how to handle the potential > deprecation. > This type of change would merit a major version bump. If possible, we'd ideally have some kind of option to disable consolidation / enable splitting, which would allow for users to test their code on older versions. > On Tue, May 26, 2020 at 12:49 PM Wes McKinney wrote: > >> Something to add here (in favor of removing the BM) -- and apologies >> if it's already mentioned in a different form: >> >> It is very, very difficult for third party code to construct >> heterogeneously-typed DataFrames without triggering a memory doubling. >> To give you an example what I mean, in Apache Arrow, we painstakingly >> implemented block consolidation in C++ [1] so that we can construct a >> DataFrame that won't suddenly double memory the first time that a user >> interacts with it. So the possibility of users having an OOM on their >> first interaction with an object they created is not great. If >> avoiding it for library developers were easy then perhaps it would be >> less of an issue, but avoiding the doubling requires advanced >> knowledge of pandas's internals. >> >> Looking back 9-10 years, the primary motivations I had for creating >> the BlockManager in the first place don't persuade me anymore: >> >> * pandas's success was still very much coupled to vectorized >> operations on wide row-major data (e.g. as present in certain sectors >> of the financial industry). I don't think this represents the majority >> of pandas users now >> * In 2011 I was uncomfortable writing significant compiled code. Many >> of the performance issues that the BM tried to ameliorate are >> non-issues if you're OK writing non-trivial C/C++ code to deal with >> row-level interactions. Even if there were a 50% performance >> regression on some of these operations that are faster with 2D blocks >> because of row-major vs. column-major memory layout, that still seems >> worth it for the vast code simplification and the >> memory-use-predictability benefits that others have articulated >> already. >> >> - Wes >> >> [1]: >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc >> >> On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche >> wrote: >> > >> > On Tue, 26 May 2020 at 13:21, Tom Augspurger < >> tom.augspurger88 at gmail.com> wrote: >> >> >> >> >> >> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> >> >>> - We could make the DataFrame construction from a 2D array/matrix >> kind of "lazy" (or have an option to do it like this): upon construction >> just store the 2D array as is, and only once you perform an actual >> operation on it, convert to a columnar store. And that would make it >> possible to still get the 2D array back with zero-copy, if all you did was >> passing this DataFrame to the next step of the pipeline. >> >>> >> >>> I think the first option should be fairly easy to do, and should >> solve a large part of the concerns for scikit-learn (I think?). >> >> >> >> >> >> I think the first option would solve that use case for scikit-learn. >> It sounds feasible, but I'm not sure how easy it would be. >> >> >> > >> > A quick, ugly proof-of-concept: >> https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188 >> > >> > It allows to create a "DataFrame" from an ndarray without creating a >> BlockManager, and it allows accessing this original ndarray: >> > >> > In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), >> (pd.RangeIndex(4), pd.RangeIndex(3))) >> > >> > In [2]: df._mgr_data >> > Out[2]: >> > (array([[ 1.52971972e-01, -5.69204971e-01, 5.54430115e-01], >> > [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00], >> > [ 7.05185110e-01, -1.53009348e-03, 1.54260335e+00], >> > [-4.60590231e-01, -3.85364427e-01, 1.80760103e+00]]), >> > RangeIndex(start=0, stop=4, step=1), >> > RangeIndex(start=0, stop=3, step=1)) >> > >> > And once you do something with the dataframe, such as printing or >> calculating something, the BlockManager gets only created at this step: >> > >> > In [3]: df >> > Out[3]: Initializing !!! >> > >> > 0 1 2 >> > 0 0.152972 -0.569205 0.554430 >> > 1 -1.099161 -1.163154 -1.510711 >> > 2 0.705185 -0.001530 1.542603 >> > 3 -0.460590 -0.385364 1.807601 >> > >> > In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), >> (pd.RangeIndex(4), pd.RangeIndex(3))) >> > >> > In [5]: df.mean() >> > Initializing !!! >> > Out[5]: >> > 0 0.397243 >> > 1 0.269996 >> > 2 -0.454929 >> > dtype: float64 >> > >> > There are of course many things missing (validation of the input to >> init_lazy, potentially being able to access df.index/df.columns without >> initializing the block manager, hooking this up in __array__, what with >> pickling?, ...) >> > But just to illustrate the idea. >> > _______________________________________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > https://mail.python.org/mailman/listinfo/pandas-dev >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed May 27 15:57:28 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 27 May 2020 21:57:28 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: On Tue, 26 May 2020 at 23:00, Tom Augspurger wrote: > > On Tue, May 26, 2020 at 3:50 PM Brock Mendel > wrote: > >> > It allows to create a "DataFrame" from an ndarray without creating a >> BlockManager, and it allows accessing this original ndarray: >> >> This is a neat proof of concept, but it cuts against the "decreases >> complexity" argument. Is there a viable way to quantify (even very >> roughly) the complexity effect of going all-1D? >> > > That complexity is at least localized to a single attribute. That's quite > different from the 1D & 2D blocks situation, where many methods (though > fewer than a year ago) need to be concerned with whether the array in a > block is 1D or 2D, or whether the DataFrame is consolidated, homogenous, ... > > I don't think this "lazy _mgr attribute" is comparable in complexity with the consolidated BlockManager. Furthermore: it's targeted to a very specific and limited use case (and eg also doesn't need to be the default, I think). Now, exactly quantifying the effect of going all-1D, that's of course hard. But just one example: all code that deals with blknos/blklocs (the mapping between the position in the consolidated blocks and the position in the dataframe), which is a significant part of managers.py, could be simplified considerably. But anyway: I think it clear that a BlockManager with only 1D arrays/blocks *can* be simpler as one with interleaved/consolidated blocks. But this is also only one of the arguments. Complexity alone is not a reason to not do something; it's the general trade-off with what you gain or lose with it. > A couple ideas for ways to simplify this decision-making problem: >> > > >> 2) IIUC going all-1D will cause column indexing to always return views. >> Elsewhere you have noted that this is a breaking API change which merited >> discussion in its own right. xref #33780 >> . My takeaway from >> this part of the last dev call was that people were generally positive on >> the all-views idea, but were wary of how to handle the potential >> deprecation. >> > > This type of change would merit a major version bump. If possible, we'd > ideally have some kind of option to disable consolidation / enable > splitting, which would allow for users to test their code on older versions. > Yes, going to an all-1D-BlockManager would be something for a major version bump, eg pandas 2.0. So I think that is the perfect opportunity to do such a change of making column selections always views. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Wed May 27 17:07:41 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Wed, 27 May 2020 14:07:41 -0700 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: > I don't think this "lazy _mgr attribute" is comparable in complexity with the consolidated BlockManager Not on its own, no. But my prior is that this isn't the last thing that will merit its own special case. > I think it clear that a BlockManager with only 1D arrays/blocks *can* be simpler as one with interleaved/consolidated blocks. Absolutely agree. I've spent a big chunk of the last year dealing with BlockManager code and have no great love for it. > But this is also only one of the arguments. Complexity alone is not a reason to not do something; it's the general trade-off with what you gain or lose with it. The main upsides I see are a) internal complexity reduction, b) downstream library upsides, c) clearer view vs copy semantics, d) perf improvements from making fewer copies, e) clear "dict of Series" data model. The main downside is potential performance degradation (at the extreme end e.g. 3000x for arithmetic). As Wes commented some of that can be ameliorated with compiled code but that cuts against the complexity reduction. I am looking for ways to quantify these tradeoffs so we can make an informed decision. On Wed, May 27, 2020 at 12:57 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > On Tue, 26 May 2020 at 23:00, Tom Augspurger > wrote: > >> >> On Tue, May 26, 2020 at 3:50 PM Brock Mendel >> wrote: >> >>> > It allows to create a "DataFrame" from an ndarray without creating a >>> BlockManager, and it allows accessing this original ndarray: >>> >>> This is a neat proof of concept, but it cuts against the "decreases >>> complexity" argument. Is there a viable way to quantify (even very >>> roughly) the complexity effect of going all-1D? >>> >> >> That complexity is at least localized to a single attribute. That's quite >> different from the 1D & 2D blocks situation, where many methods (though >> fewer than a year ago) need to be concerned with whether the array in a >> block is 1D or 2D, or whether the DataFrame is consolidated, homogenous, ... >> >> > I don't think this "lazy _mgr attribute" is comparable in complexity with > the consolidated BlockManager. Furthermore: it's targeted to a very > specific and limited use case (and eg also doesn't need to be the default, > I think). > Now, exactly quantifying the effect of going all-1D, that's of course > hard. But just one example: all code that deals with blknos/blklocs (the > mapping between the position in the consolidated blocks and the position in > the dataframe), which is a significant part of managers.py, could be > simplified considerably. > > But anyway: I think it clear that a BlockManager with only 1D > arrays/blocks *can* be simpler as one with interleaved/consolidated > blocks. But this is also only one of the arguments. Complexity alone is not > a reason to not do something; it's the general trade-off with what you gain > or lose with it. > > >> A couple ideas for ways to simplify this decision-making problem: >>> >> >> >>> 2) IIUC going all-1D will cause column indexing to always return views. >>> Elsewhere you have noted that this is a breaking API change which merited >>> discussion in its own right. xref #33780 >>> . My takeaway from >>> this part of the last dev call was that people were generally positive on >>> the all-views idea, but were wary of how to handle the potential >>> deprecation. >>> >> >> This type of change would merit a major version bump. If possible, we'd >> ideally have some kind of option to disable consolidation / enable >> splitting, which would allow for users to test their code on older versions. >> > > Yes, going to an all-1D-BlockManager would be something for a major > version bump, eg pandas 2.0. So I think that is the perfect opportunity to > do such a change of making column selections always views. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed May 27 17:15:32 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 27 May 2020 23:15:32 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: On Wed, 27 May 2020 at 23:07, Brock Mendel wrote: > > I don't think this "lazy _mgr attribute" is comparable in complexity > with the consolidated BlockManager > > Not on its own, no. But my prior is that this isn't the last thing that > will merit its own special case. > > > I think it clear that a BlockManager with only 1D arrays/blocks *can* be > simpler as one with interleaved/consolidated blocks. > > Absolutely agree. I've spent a big chunk of the last year dealing with > BlockManager code and have no great love for it. > > > But this is also only one of the arguments. Complexity alone is not a > reason to not do something; it's the general trade-off with what you gain > or lose with it. > > The main upsides I see are a) internal complexity reduction, b) downstream > library upsides, c) clearer view vs copy semantics, d) perf improvements > from making fewer copies, e) clear "dict of Series" data model. > > The main downside is potential performance degradation (at the extreme end > e.g. 3000x for > arithmetic). As Wes commented some of that can be ameliorated with > compiled code but that cuts against the complexity reduction. > That number is not correct. That was comparing the block-wise operation to a very inefficient convert-each-column-to-a-series operation. We can optimize this column-wise operation a lot (as I already did on master for some cases), and then a slowdown will still be present in such extreme cases, but *much* less. > > I am looking for ways to quantify these tradeoffs so we can make an > informed decision. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Fri May 29 12:37:19 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Fri, 29 May 2020 09:37:19 -0700 Subject: [Pandas-dev] tslibs 2.0 and non-nanosecond datetime64/timedelta64 Message-ID: This is a discussion of what it would take to support non-nanosecond datetime64/timedelta64 dtypes and what decisions would need to be made along the way. The implementation would probably consist of: - add a NPY_DATETIMEUNIT attribute to Timestamp and Datetime64TZDtype - for timezone-related methods: - short-term: cast to nanosecond, use existing code, cast back to other unit - longer-term: update existing code to support non-nano units directly - comb through the code for all the places where we implicitly assume nano units and update - tests, so, so many tests We could then consider de-duplication. Tick is already redundant with Timedelta, and Timestamp[H] would render Period[H] redundant. With appropriate deprecation cycle, we could rip out a bunch of code. Another possibility is to try to upstream some code to numpy, which they have recently been receptive to (#16266 , #16363 , #16364 , #16352 , #16195 ). @rgommers tells me that trying to implement a tz-aware datetime64 dtype in numpy would be "folly, that way madness lies", but that it might be more feasible once @seberg's dtype refactor lands. More realistically short-term, if we convinced numpy to update NPY_DATETIMEUNIT to include the anchored quarter/year/week units we use for Period, we could condense a lot of confusing enum-like code. Tangentially related: with zoneinfo (PEP 615) we should consider making those our canonical tzinfos and converting any dateutil/pytz tzinfos we encounter to those. They are implemented in C, so I'm _hopeful_ we can make some of our vectorized tzconversion code unnecessary. @pganssle has suggested we implement our own tzinfos, but I'm holding out hope we can keep that upstream. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Fri May 29 13:34:01 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 29 May 2020 19:34:01 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: On Wed, 27 May 2020 at 23:07, Brock Mendel wrote: > > The main upsides I see are a) internal complexity reduction, b) downstream > library upsides, c) clearer view vs copy semantics, d) perf improvements > from making fewer copies, e) clear "dict of Series" data model. > > The main downside is potential performance degradation (at the extreme end > e.g. 3000x for > arithmetic). As Wes commented some of that can be ameliorated with > compiled code but that cuts against the complexity reduction. > > I am looking for ways to quantify these tradeoffs so we can make an > informed decision. > > Can you try to explain a bit more what kind of quantification you are looking for? - Complexity: I think we agree a non-consolidating block manager *can* be simpler? (and it's not only the internals, also eg the algos become simpler). But I am not sure this can be expressed in a number. - Clearer view vs copy semantics: this is partly an issue of making pandas easier to understand (both as developer and user), which again seems hard to quantify. And partly an issue of performance / memory usage. This is something that could potentially be measured (eg the memory usage of some typical workflows). But this probably also something that might only show effect after a refactor / implementation of new semantics. - Potential performance degradation: here you can measure things, and I actually did that for some cases, see https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c (the notebook that I posted in #10556 a few days ago). However: 1) a lot depends on what kind of dataframe you take for your benchmarks (number of rows vs number of columns), 2) there are of course a lot of potential operations to test, 3) there will be a set of operations that will always be slower with a columnar dataframe, whatever the optimization, and 4) we would be testing with current pandas, which is often not yet optimized for column-wise operations. I would be fine with choosing a set of example datasets with example operations, on which we can have some comparisons. My notebook linked above is already something like that (in a limited form), I think. From this set of timings, I personally don't see any insurmountable performance degradations. But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?). Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Fri May 29 15:03:27 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Fri, 29 May 2020 14:03:27 -0500 Subject: [Pandas-dev] tslibs 2.0 and non-nanosecond datetime64/timedelta64 In-Reply-To: References: Message-ID: Thanks for the update. On Fri, May 29, 2020 at 11:37 AM Brock Mendel wrote: > This is a discussion of what it would take to support non-nanosecond > datetime64/timedelta64 dtypes and what decisions would need to be made > along the way. > > The implementation would probably consist of: > - add a NPY_DATETIMEUNIT attribute to Timestamp and Datetime64TZDtype > - for timezone-related methods: > - short-term: cast to nanosecond, use existing code, cast back to > other unit > Will this cause issues if the original datetime isn't in the bounds of a ns-precision timestamp? > - longer-term: update existing code to support non-nano units directly > - comb through the code for all the places where we implicitly assume nano > units and update > - tests, so, so many tests > > We could then consider de-duplication. Tick is already redundant with > Timedelta, and Timestamp[H] would render Period[H] redundant. With > appropriate deprecation cycle, we could rip out a bunch of code. > What would the user facing changes that warrant deprecation? For me, `Period` represents a span of time. It would make sense to implement something like `pd.Timestamp("2000-01-01") in pd.Period("2000-01-01", freq="H")`. But something checking whether that timestamp is in a `Timestamp[H]` doesn't seem natural, since it represents a point in time rather than a span. > Another possibility is to try to upstream some code to numpy, which they > have recently been receptive to (#16266 > , #16363 > , #16364 > , #16352 > , > #16195 > ). @rgommers tells me that > trying to implement a tz-aware datetime64 dtype in numpy would be "folly, > that way madness lies", but that it might be more feasible once @seberg's > dtype refactor lands. More realistically short-term, if we convinced numpy > to update NPY_DATETIMEUNIT to include the anchored quarter/year/week units > we use for Period, we could condense a lot of confusing enum-like code. > Great to see this being pushed upstream! > Tangentially related: with zoneinfo (PEP 615) we should consider making > those our canonical tzinfos and converting any dateutil/pytz tzinfos we > encounter to those. They are implemented in C, so I'm _hopeful_ we can > make some of our vectorized tzconversion code unnecessary. @pganssle has > suggested we implement our own tzinfos, but I'm holding out hope we can > keep that upstream. > I'd be happy to see this as well, though implementing it in a way that's compatible with older Pythons seems a bit tricky. Perhaps we get the building blocks in place and then require it once we require Python 3.10+? > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul at ganssle.io Fri May 29 15:31:54 2020 From: paul at ganssle.io (Paul Ganssle) Date: Fri, 29 May 2020 15:31:54 -0400 Subject: [Pandas-dev] tslibs 2.0 and non-nanosecond datetime64/timedelta64 In-Reply-To: References: Message-ID: <28577046-9ff1-1cd9-d64a-401962139200@ganssle.io> > Tangentially related: with zoneinfo (PEP 615) we should consider > making those our canonical tzinfos and converting any > dateutil/pytz tzinfos we encounter to those.? They are implemented > in C, so I'm _hopeful_ we can make some of our vectorized > tzconversion?code unnecessary.??@pganssle has suggested we > implement our own tzinfos, but I'm holding out hope we can keep > that upstream. > > > I'd be happy to see this as well, though implementing it in a way > that's compatible with older Pythons seems a bit tricky. Perhaps we > get the building blocks in place and then require it once we require > Python 3.10+? The reference implementation for PEP 615 has been converted to a backport for Python 3.6+ , so as long as you're willing to take on the dependency on the backport (which depends only on things that are in the standard library in Python 3.8+, but has transitive dependencies on backports on Python 3.6 and 3.7), you can just use that. The trickier thing, to me, is that there is a somewhat contrived workflow but definitely not an /implausible/ one, that would be broken by switching away from pytz. If someone constructs an aware Timestamp/Series/etc, then uses the tz attribute to get a time zone they can use for other stuff, they should currently be using the `localize`/`normalize` functions, like so: >>> from datetime import datetime >>> import pandas as pd >>> ts = pd.Timestamp("2020-01-01", tz="America/New_York") >>> ts.tz >>> ts.tz.localize(datetime.now()) datetime.datetime(2020, 5, 29, 15, 16, 56, 376299, tzinfo=) This is a pytz-specific idiom and won't work for zoneinfo or dateutil zones, but it may inadvertently be part of your public API, so it's up to you whether to consider it part of the public interface. In that case, I think the decision should be between a hard break and having `.tz` return a wrapper class that tries to more or less do what `pytz` does if you call `localize`/`normalize` with it. Best, Paul P.S. To clarify my position on "you should implement your own tzinfos": I think you should /start/ with adding support for generic time zones (not digging around into the internals to try to get speed improvements as is currently done) and see if zoneinfo is dramatically slower. If it is and you care about that, then a custom vectorized time zone should be the way to go. On the plus side, it's easy to test these things that are "rewritten for performance reasons only" using property testing; with the right set of property tests you can still get some of the "with enough eyes all bugs are shallow" benefits of using a standard library module. On 5/29/20 3:03 PM, Tom Augspurger wrote: > Thanks for the update. > > On Fri, May 29, 2020 at 11:37 AM Brock Mendel > wrote: > > This is a discussion of what it would take to support > non-nanosecond datetime64/timedelta64 dtypes and what decisions > would need to be made along the way. > > The implementation would probably consist of: > - add a NPY_DATETIMEUNIT attribute to Timestamp and Datetime64TZDtype > - for timezone-related?methods: > ? ? - short-term: cast to nanosecond, use existing code, cast back > to other unit > > > Will this cause issues if the original datetime isn't in the bounds of > a ns-precision timestamp? > ? > > ? ? - longer-term: update existing code to support non-nano units > directly > - comb through the code for all the places where we implicitly > assume nano units and update > - tests, so, so many tests > > We could then consider de-duplication.?Tick is already redundant > with Timedelta, and Timestamp[H] would render Period[H] > redundant.? With appropriate deprecation cycle, we could rip out a > bunch of code. > > > What would the user facing changes that warrant deprecation? For me, > `Period` represents a span of time. It would make sense to implement > something like `pd.Timestamp("2000-01-01") in pd.Period("2000-01-01", > freq="H")`. But something checking whether that timestamp is in a > `Timestamp[H]` doesn't seem natural, since it represents a point in > time rather than a span. > ? > > Another possibility is to try to upstream some code to numpy, > which they have recently been receptive to (#16266 > ,?#16363 > ,?#16364 > ,?#16352 > ,? > #16195 > ).? @rgommers tells > me that trying to implement a tz-aware datetime64 dtype in numpy > would be "folly, that way madness lies", but that it might be more > feasible once?@seberg's dtype refactor lands.? More realistically > short-term, if we convinced numpy to update NPY_DATETIMEUNIT to > include the anchored quarter/year/week units we use for Period, we > could condense a lot of confusing enum-like code. > > > Great to see this being pushed upstream! > > ? > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature URL: From maartenb at xs4all.nl Fri May 29 14:31:44 2020 From: maartenb at xs4all.nl (Maarten Ballintijn) Date: Fri, 29 May 2020 14:31:44 -0400 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> Hi Joris, You said: > But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?). This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other areas where storing data for 1000?s of elements (sensors, items, people) on grid of time scales of minutes or more. (n*1000 x m*1000 data with n, m ~ 10 .. 100) Why do you think this use case is no longer important? We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to improve in this area not slide back. Have a great weekend, Maarten > On May 29, 2020, at 1:34 PM, Joris Van den Bossche wrote: > > On Wed, 27 May 2020 at 23:07, Brock Mendel > wrote: > > The main upsides I see are a) internal complexity reduction, b) downstream library upsides, c) clearer view vs copy semantics, d) perf improvements from making fewer copies, e) clear "dict of Series" data model. > > The main downside is potential performance degradation (at the extreme end e.g. 3000x for arithmetic). As Wes commented some of that can be ameliorated with compiled code but that cuts against the complexity reduction. > > I am looking for ways to quantify these tradeoffs so we can make an informed decision. > > Can you try to explain a bit more what kind of quantification you are looking for? > > - Complexity: I think we agree a non-consolidating block manager can be simpler? (and it's not only the internals, also eg the algos become simpler). But I am not sure this can be expressed in a number. > - Clearer view vs copy semantics: this is partly an issue of making pandas easier to understand (both as developer and user), which again seems hard to quantify. And partly an issue of performance / memory usage. This is something that could potentially be measured (eg the memory usage of some typical workflows). But this probably also something that might only show effect after a refactor / implementation of new semantics. > - Potential performance degradation: here you can measure things, and I actually did that for some cases, see https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c (the notebook that I posted in #10556 a few days ago). > > However: 1) a lot depends on what kind of dataframe you take for your benchmarks (number of rows vs number of columns), 2) there are of course a lot of potential operations to test, 3) there will be a set of operations that will always be slower with a columnar dataframe, whatever the optimization, and 4) we would be testing with current pandas, which is often not yet optimized for column-wise operations. > > I would be fine with choosing a set of example datasets with example operations, on which we can have some comparisons. > My notebook linked above is already something like that (in a limited form), I think. From this set of timings, I personally don't see any insurmountable performance degradations. > > But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?). > > Joris > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Sat May 30 15:03:41 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Sat, 30 May 2020 21:03:41 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> Message-ID: Hi Maarten, Thanks a lot for the feedback! On Fri, 29 May 2020 at 20:31, Maarten Ballintijn wrote: > > Hi Joris, > > You said: > > But I also deliberately choose a dataframe where n_rows >> n_columns, > because I personally would be fine if operations on wide dataframes (n_rows > < n_columns) show a slowdown. But that is of course something to discuss / > agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we > care about a performance degradation?). > > > This is an (the) important use case for us and probably for a lot of use > in finance in general. I can easily imagine many other > areas where storing data for 1000?s of elements (sensors, items, people) > on grid of time scales of minutes or more. > (n*1000 x m*1000 data with n, m ~ 10 .. 100) > > Why do you think this use case is no longer important? > To be clear up front: I think wide dataframes are still an important use case. But to put my comment from above in more context: we had a performance regression reported (#24990 , which Brock referenced in his last mail) which was about a DataFrame with 1 row and 5000 columns. And yes, for *such* a case, I think it will basically be impossible to preserve exact performance, even with a lot of optimizations, compared to storing this as a single, consolidated (1, 5000) array as is done now. And it is for such a case, that I indeed say: I am willing to accept a limited slowdown for this, *if* it at the same time gives us improved memory usage, performance improvements for more common cases, simplified internals making it easier to contribute to and further optimize pandas, etc. But, I am also quite convinced that, with some optimization effort, we can at least preserve the current performance even for relatively wide dataframes (see eg this notebook for some quick experiments). And to be clear: doing such optimizations to ensure good performance for a variety of use cases is part of the proposal. Also, I think that having a simplified pandas internals should actually also make it easier to further explore ways to specifically optimize the "homogeneous-dtype wide dataframe" use case. Now, it is always difficult to make such claims in the abstract. So what I personally think would be very valuable, is if you could give some example use cases that you care about (eg a notebook creating some dummy data with similar characteristics as the data you are working with (or using real data, if openly available, and a few typical operations you do on those). Best, Joris > > We already have to drop into numpy on occasion to make the performance > sufficient. I would really prefer for Pandas to > improve in this area not slide back. > > Have a great weekend, > Maarten > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Sat May 30 15:17:56 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Sat, 30 May 2020 21:17:56 +0200 Subject: [Pandas-dev] tslibs 2.0 and non-nanosecond datetime64/timedelta64 In-Reply-To: References: Message-ID: Thanks for starting this discussion, Brock! On Fri, 29 May 2020 at 21:03, Tom Augspurger wrote: > On Fri, May 29, 2020 at 11:37 AM Brock Mendel > wrote: > >> >> We could then consider de-duplication. Tick is already redundant with >> Timedelta, and Timestamp[H] would render Period[H] redundant. With >> appropriate deprecation cycle, we could rip out a bunch of code. >> > > What would the user facing changes that warrant deprecation? For me, > `Period` represents a span of time. It would make sense to implement > something like `pd.Timestamp("2000-01-01") in pd.Period("2000-01-01", > freq="H")`. But something checking whether that timestamp is in a > `Timestamp[H]` doesn't seem natural, since it represents a point in time > rather than a span. > > Personally, I don't think we necessarily need to add all units that are supported by numpy's datetime64/timedelta64 dtypes. First, because I don't think it is an important use case (people mostly want to be able to have dates outside of the range limits that nanosecond resolution gives us), and also because it makes it conceptually a lot more difficult. For example, what is a "Timestamp[H]" value? Does it represent the beginning or the end of the hour? That are questions that are already handled by the Period dtype, and I think it is a good thing to keep those concepts separated (you can of course ask the same question with a millisecond resolution, but I think generally people don't do that). Further, all the resolutions from nanosecond up to second are "just" multiplications x1000, keeping the implementation more simple (compared to resolutions of hours, months, ..). So for a timestamp dtype, we could maybe only support ns / ?s / ms / s resolutions? -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Sat May 30 17:54:59 2020 From: adrin.jalali at gmail.com (Adrin) Date: Sat, 30 May 2020 23:54:59 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> Message-ID: Although 1 x 5000 may sound an edge case, my whole 4 years of research was on 500 x 450000 data. Those usecases are probably more common than we may think. On Sat., May 30, 2020, 21:03 Joris Van den Bossche, < jorisvandenbossche at gmail.com> wrote: > Hi Maarten, > > Thanks a lot for the feedback! > > On Fri, 29 May 2020 at 20:31, Maarten Ballintijn > wrote: > >> >> Hi Joris, >> >> You said: >> >> But I also deliberately choose a dataframe where n_rows >> n_columns, >> because I personally would be fine if operations on wide dataframes (n_rows >> < n_columns) show a slowdown. But that is of course something to discuss / >> agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we >> care about a performance degradation?). >> >> >> This is an (the) important use case for us and probably for a lot of use >> in finance in general. I can easily imagine many other >> areas where storing data for 1000?s of elements (sensors, items, people) >> on grid of time scales of minutes or more. >> (n*1000 x m*1000 data with n, m ~ 10 .. 100) >> >> Why do you think this use case is no longer important? >> > > To be clear up front: I think wide dataframes are still an important use > case. > > But to put my comment from above in more context: we had a performance > regression reported (#24990 > , which Brock > referenced in his last mail) which was about a DataFrame with 1 row and > 5000 columns. > And yes, for *such* a case, I think it will basically be impossible to > preserve exact performance, even with a lot of optimizations, compared to > storing this as a single, consolidated (1, 5000) array as is done now. And > it is for such a case, that I indeed say: I am willing to accept a limited > slowdown for this, *if* it at the same time gives us improved memory > usage, performance improvements for more common cases, simplified internals > making it easier to contribute to and further optimize pandas, etc. > > But, I am also quite convinced that, with some optimization effort, we can > at least preserve the current performance even for relatively wide > dataframes (see eg this > > notebook > > for some quick experiments). > And to be clear: doing such optimizations to ensure good performance for a > variety of use cases is part of the proposal. Also, I think that having a > simplified pandas internals should actually also make it easier to further > explore ways to specifically optimize the "homogeneous-dtype wide > dataframe" use case. > > Now, it is always difficult to make such claims in the abstract. > So what I personally think would be very valuable, is if you could give > some example use cases that you care about (eg a notebook creating some > dummy data with similar characteristics as the data you are working with > (or using real data, if openly available, and a few typical operations you > do on those). > > Best, > Joris > > >> >> We already have to drop into numpy on occasion to make the performance >> sufficient. I would really prefer for Pandas to >> improve in this area not slide back. >> >> Have a great weekend, >> Maarten >> >> >> _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Sat May 30 18:39:07 2020 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Sat, 30 May 2020 17:39:07 -0500 Subject: [Pandas-dev] tslibs 2.0 and non-nanosecond datetime64/timedelta64 In-Reply-To: (sfid-20200529_183734_278427_78B1B51E) References: (sfid-20200529_183734_278427_78B1B51E) Message-ID: On Fri, 2020-05-29 at 09:37 -0700, Brock Mendel wrote: > This is a discussion of what it would take to support non-nanosecond > datetime64/timedelta64 dtypes and what decisions would need to be > made > along the way. > > The implementation would probably consist of: > - add a NPY_DATETIMEUNIT attribute to Timestamp and Datetime64TZDtype > - for timezone-related methods: > - short-term: cast to nanosecond, use existing code, cast back to > other > unit > - longer-term: update existing code to support non-nano units > directly > - comb through the code for all the places where we implicitly assume > nano > units and update > - tests, so, so many tests > > We could then consider de-duplication. Tick is already redundant with > Timedelta, and Timestamp[H] would render Period[H] redundant. With > appropriate deprecation cycle, we could rip out a bunch of code. > > Another possibility is to try to upstream some code to numpy, which > they > have recently been receptive to (#16266 > ;, #16363 > ;, #16364 > ;, #16352 > ;, > #16195 > ;). @rgommers tells me > that > trying to implement a tz-aware datetime64 dtype in numpy would be > "folly, > that way madness lies", but that it might be more feasible once > @seberg's > dtype refactor lands. Timezones do seem like to much complexity to add to numpy. And with dtypes refactor should not actually be required to live within NumPy hopefully soon. The more likely discussion would be to go the opposite direction :). Since: np.array([datetime.datetime(2019, 1, 1)]) gives an object array, NumPy datetimes should not have any long term advantage over an externally developed datetime (except living in the prominent numpy namespace). Having a new datetime dtype external to NumPy and with tz-info indeed seems very desirable. And I would be happy to have you in the loop, so we could maybe even use it as an early test balloon by including it as a test in NumPy. With the idea to later cut it out as a stand-alone package. But that would be mostly useful if you are excited to about getting a small head-start. In the end, it would likely help me/NumPy more then you in terms of time-investment. > More realistically short-term, if we convinced numpy > to update NPY_DATETIMEUNIT to include the anchored quarter/year/week > units > we use for Period, we could condense a lot of confusing enum-like > code. On first sight, that does sound reasonable and probably only depends on the complexity. If it does not increase numpy's code complexity too much (and obviously it decreases pandas' quite a bit more). I assume that this would mainly move some fairly straight forward and thoroughly tested code from pandas into NumPy? Can't say I am excited about reviewing datetime code, but upstreaming seems much better for the community than band-aids in pandas... - Sebastian > > Tangentially related: with zoneinfo (PEP 615) we should consider > making > those our canonical tzinfos and converting any dateutil/pytz tzinfos > we > encounter to those. They are implemented in C, so I'm _hopeful_ we > can > make some of our vectorized tzconversion code unnecessary. @pganssle > has > suggested we implement our own tzinfos, but I'm holding out hope we > can > keep that upstream. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: