[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Jeff Reback jeffreback at gmail.com
Mon Jan 11 19:19:29 EST 2016


Stephan

Seaborn does use Series/DataFrame internally as first class data
> structures. But for xarray and statsmodels it is the other way around --
> pandas objects are accepted as input, but coerced into NumPy arrays
> internally for storage and manipulation. This presents issues for new types
> with metadata like categorical.



care to elaborate on the xarray decision to keep data as numpy arrays,
rather than Series in DataArray? (as you do keep the Index objects intact).


On Mon, Jan 11, 2016 at 6:35 PM, Stephan Hoyer <shoyer at gmail.com> wrote:

> On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback <jeffreback at gmail.com> wrote:
>
>> I think the pandas efforts (and other libraries) can result in more
>> powerful fundamental libraries
>> that get pushed upstream. However, it would not benefit ANYONE to slow
>> down downstream efforts. I am not sure why you suggest that we WAIT for the
>> upstream libraries to change? We have been waiting forever for that. Now we
>> have a concrete implementation of certain data types that are useful. They
>> (upstream) can take
>> this and build on (or throw it away and make a better one or whatever).
>> But I don't think it benefits anyone to WAIT for someone to change numpy
>> first.
>> Look at how long it took them to (partially) fix datetimes.
>>
>
> I agree, it is insane to wait on upstream improvements to spontaneously
> happen on their own. We (interested downstream developers) would need to
> push them through. I started on this recently for making datetime64
> timezone naive (https://github.com/numpy/numpy/pull/6453) -- though of
> course, this is one of the easier issue.
>
> Of course, this being open source, my suggestions require someone
> interested in doing all the hard work. And given that that is not me,
> perhaps I should just shut up :).
>
> If the best we think we can realistically do is Wes writing our own data
> type system, then I'll be a little sad, but it would still be a win.
>
>
>> xarray in particular has done the same thing to pandas, e.g. you have
>> added additional selection operators and syntax (e.g. passing dicts of
>> named axes). These changes are in fact propogating to pandas. This has
>> taken time (but much much less that this took for any of pandas changes to
>> numpy). Further look at how long you have advocated (correctly) for labeled
>> arrays in numpy (which we are still waiting).
>>
>
> I'm actually not convinced NumPy needs labeled arrays. In my mind,
> libraries like pandas and xarray solve the labeled array problem very well
> downstream of NumPy. There are costs to making the basic libraries label
> aware.
>
>
>> I'd like to see pandas itself focus more on the data-structures and less
>>> on the data types. This would let us share more work with the "general
>>> purpose array / scientific computing libraries".
>>>
>>> Pandas IS about specifying the correct data types. It is simply
>> incorrect to decouple this problem from the data-structures. A lot of
>> effort over the years has gone into
>> making all dtypes playing nice with each other and within pandas.
>>
>
> Yes, a lot of effort has gone into dtypes in pandas. This is great! But
> wouldn't it be even better if we had a viable path for pushing this stuff
> upstream? ;)
>
>
>> maybe, but DyND has to have full compat with what currently is out there
>> (soonish). Then I agree this could be possible. But wouldn't it be even
>> better
>> for pandas to be able to swap back-ends. Why limit ourselves to a
>> particular backend if its not that difficult.
>>
>
> Well, Irwin, what do you say? :)
>
> I'm just saying that in my ideal world, we would not invent a new dtype
> standard for pandas (insert obligatory xkcd reference here).
>
> I disagree entirely here. I think that Series/DataFrame ARE becoming
>> primitive objects. Look at seaborn, statsmodels, and xarray These are first
>> class users of these structures, whom need the additional meta-data
>> attached.
>>
>
> Seaborn does use Series/DataFrame internally as first class data
> structures. But for xarray and statsmodels it is the other way around --
> pandas objects are accepted as input, but coerced into NumPy arrays
> internally for storage and manipulation. This presents issues for new types
> with metadata like categorical.
>
> Best,
> Stephan
>
>>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160111/01f0ab4d/attachment.html>


More information about the Pandas-dev mailing list