[Pandas-dev] Pandas Sprint Recap

Matthew Rocklin mrocklin at gmail.com
Tue Jul 17 18:46:45 EDT 2018


Has Pandas ever done a user survey?

I would be curious to know the answer to the question "do you make heavy
use of the Pandas index" among users, and how that correlates with
different domain/industry.

On Tue, Jul 17, 2018 at 6:29 PM Stephan Hoyer <shoyer at gmail.com> wrote:

> On Tue, Jul 17, 2018 at 2:01 AM Pietro Battiston <me at pietrobattiston.it>
> wrote:
>
>> First, because labels/indexes are in my experience the main reason why
>> people come to pandas (another important reason is having multiple
>> dtypes in a single data structure, but numpy structured arrays also do
>> this).
>>
>
> Certainly the functionality of indexes is valuable (especially from some
> use-cases), but I don't think the particular way we expose them is optimal.
> In my experience, the need to call reset_index() or assign directly to
> .index or .columns is a frequent source of annoyance.
>
>
>> Second because supporting a DataFrame with no index would be pretty
>> easy in the current codebase/API (e.g. "index=False").
>> I know it would break some code, but it would be wrong code anyway
>> (that is, code that doesn't decouple indexes from data storage).
>>
>> Third, because now that the default index is RangeIndex(n) (which a
>> user is free not to rely on in any way), and as long as broken code is
>> fixed (see above), a DataFrame with no index wouldn't really be
>> "simpler". It would mostly amount to deciding whether to show the index
>> or not when printing to screen/doing IO.
>>
>
> Sure, you *could* fix all this on top of the current pandas data model.
> But it would be quite a challenging effort, and the full pandas data model
> would remain quite complex.
>
> The current pandas data model looks something like this:
>
> DataFrame:
> - values: BlockManager wrapping 1d and/or 2d NumPy arrays
> - index: Index
> - columns: Index
>
> The data model I'd like to work with in the future for most use-cases
> involving tabular data is something closer to:
>
> DataFrame:
> - data: OrderedDict[str, Array]
> - indexes: OrderedDict[str, Index]
>
> Conveniently, this looks very similar to the data model of Arrow or R.
> Optional indexes would provide fast reverse lookup for some subset of
> dataframe columns.
>
> This pretty obviously could not support everything pandas can do today.
> For example, you couldn't have a hierarchical index for column names. But
> in my experience, you're better off working with "tidy data" anyways, as
> popularized in R's tidyverse.
>
>
>> But again, I fail to see a new "scope". I don't see an analysis of
>> which (share) of the current pandas problems (=issues) would be solved.
>>
>
> One way in which we have reduced pandas' scope recently is the proposed
> deprecation of Panel.
>
> This is an example of focusing pandas on tabular data rather than
> N-dimensional arrays.
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180717/1972c18a/attachment.html>


More information about the Pandas-dev mailing list