[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Fri Dec 17 15:04:05 EST 2021

We have planned a video meeting about this topic next week Wednesday,
December 22, at 19:00 UTC.
The meeting has been added to the pandas development calendar visible at
https://pandas.pydata.org/docs/development/meeting.html, and the zoom
meeting link is
https://us06web.zoom.us/j/81798190900?pwd=ZEo4SnlGMGZxZkVNRkpOLzg0dld3dz09

Joris

On Tue, 7 Dec 2021 at 19:01, Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> Another update on this topic: over the last weeks I have been updating the
> status of this project (and fixing some regressions), and rerunning the
> benchmarks.
>
> You can find an overview of the results of our ASV benchmarks at
> https://github.com/pandas-dev/pandas/issues/39146#issuecomment-988002256.
> Some general points about those benchmark results:
>
> - The cases that show big slowdows are mostly related with cases where we
> do `df.values` or equivalent, i.e. converting the DataFrame to a single 2D
> array (`.values`, `to_numpy`, `transpose`, ..). Another subset of cases
> involve row-wise operations (reductions with axis=1, selecting a single row
> as a Series). I think those are the expected cases where a 1D-column store
> will always be slower.
> - Many of our ASV benchmarks use wide dataframes (eg an often-used shape
> is (1000, 1000), so a square dataframe). While it's of course important to
> cover this, I also think this is not the most common shape of dataframes,
> and in any case is giving a bit a biased view.
> - Our ASV benchmarks are mostly micro-benchmarks, or at least benchmarks
> that at most take up to 1 to 100 ms in general (by using small enough data
> to limit the runtime to this). While this is important to keep this
> benchmark suite usable, it also has the consequence that many of those
> benchmarks are partly or largely measuring "overhead" which doesn't
> necessarily increase while increasing the data size (more rows). The
> ArrayManager will typically increase this overhead, but as long as this
> overhead is in the "milliseconds" range, it does not necessarily have much
> influence on larger data workflows (depending on the exact workflow of
> course).
>
> Overall, I find the results quite reassuring: it identifies the cases
> where a slowdown is to be expected (and we will need to judge whether we
> find this acceptable), highlight some areas that can use improvement, and
> also shows that many of the benchmarks are not (or not much) impacted.
> But I think it also shows that we will need to seek more real-world
> feedback, either by constructing some macro benchmarks, or by getting user
> feedback from their real-world workflows.
>
> For the first option (macro benchmarks), I quickly cleaned up and pushed
> an experiment I did over a year ago, which is to run one query of one of
> the industry-standard benchmark suites (TPC) using pandas (
> https://nbviewer.org/github/jorisvandenbossche/pandas-benchmarks/blob/main/tpc-ds/query-1.ipynb#Time-the-full-query).
> This shows basically no difference between BlockManager vs ArrayManager.
> This if of course also only one single workflow (with narrow long
> dataframes, doing mostly groupby and merge, and the overall time is
> dominated by eg the factorize algos, which isn't affected by the dataframe
> layout), but this is something we could maybe expand with other benchmark
> cases.
>
> ---
>
> We now have a prototype implementation people can experiment with + we
> have an overview of ASV benchmark results. Given this, I think it is a good
> point to discuss again how we want to move forward with this, and whether
> we want to communicate the _intent_ to make this the default in some next
> pandas version (emphasizing "intent", since it will always depend on the
> feedback we get).
>
> Joris
>
>
> On Wed, 7 Apr 2021 at 16:28, Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> And to give another update on this topic: the development branch of
>> pandas now contains an experimental version of this "columnar store" (using
>> an ArrayManager class instead of the BlockManager under the hood, which
>> stores the columns as a list of 1D arrays), which is almost
>> feature-complete (the biggest missing links are JSON and PyTables IO).
>>
>> At the moment, there is an option to enable it for experimenting with it
>> (not yet documented, as it might still see behaviour changes):
>>
>> # set the default manager to ArrayManager
>> pd.options.mode.data_manager = "array"
>>
>> # when creating a DataFrame, you will now get one with an ArrayManager
>> instead of BlockManager
>> df = pd.DataFrame(...)
>> df = pd.read_csv(...)
>>
>> There are still some remaining work items (more IO, ironing out some
>> known bugs/todo's, checking performance), see
>> https://github.com/pandas-dev/pandas/issues/39146 to keep track of this.
>>
>> Best,
>> Joris
>>
>> On Tue, 9 Feb 2021 at 19:17, Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>>
>>> On Mon, 31 Aug 2020 at 16:20, Joris Van den Bossche <
>>> jorisvandenbossche at gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Fri, 12 Jun 2020 at 22:34, Joris Van den Bossche <
>>>> jorisvandenbossche at gmail.com> wrote:
>>>>
>>>>> On Thu, 11 Jun 2020 at 23:35, Brock Mendel <jbrockmendel at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> > We actually *have* prototypes: the prototype of the split-policy
>>>>>> discussed
>>>>>>
>>>>>> AFAICT that is a 5 year old branch.  Is there a version of this based
>>>>>> off of master that you can show asv results for?
>>>>>>
>>>>>> A correction here: that branch has been updated several times over
>>>>> the last 5 years, and a last time two weeks ago when I started this thread,
>>>>> as I explained in the github issue comment I linked to:
>>>>> https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160
>>>>>
>>>>>
>>>>>> > Also, if performance is in the end the decisive criterion, I repeat
>>>>>> my earlier remark in this thread: we need to be clearer about what we want
>>>>>> / expect.
>>>>>>
>>>>>> In principle, this is pretty much exactly what the asvs are supposed
>>>>>> to represent.
>>>>>>
>>>>>
>>>>> Well, I am repeating myself .. but I already mentioned that I am not
>>>>> sure ASV is fully useful for this, as that requires a complete working
>>>>> replacement, which is IMO too much to ask for an initial prototype.
>>>>>
>>>>> But OK, the message is clear: we need a more concrete implementation /
>>>>> prototype. So let's put this discussion aside for a moment, and focus on
>>>>> that instead. I will try to look at that in the coming weeks, but any help
>>>>> is welcome (and I will try to get it running with ASV, or at least a part
>>>>> of it).
>>>>>
>>>>>
>>>> To come back to this: I cleaned up a proof-of-concept implementation
>>>> that I started after the above discussed, and put it in a PR to
>>>> view/discuss: https://github.com/pandas-dev/pandas/pull/36010
>>>>
>>>>
>>>
>>> Another follow-up: the proof-of-concept now is merged in the master
>>> branch, and I am currently working on making it more feature complete (see
>>> https://github.com/pandas-dev/pandas/issues/39146 for an overview issue)
>>>
>>> Joris
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20211217/735d85ec/attachment.html>