From achabot at enthought.com Sun Feb 2 15:53:24 2020 From: achabot at enthought.com (Alexandre Chabot-Leclerc) Date: Sun, 2 Feb 2020 14:53:24 -0600 Subject: [Pandas-dev] =?utf-8?q?SciPy_2020_Tutorials_=E2=80=94_Call_for_P?= =?utf-8?q?apers?= Message-ID: Dear pandas developers, The SciPy tutorial committee is interested in including a tutorial on advanced pandas this year and we would like to encourage you to submit for SciPy 2020. Tutorials will be presented July 6?7 and, as usual, the conference is in Austin. In recognition of the effort required to plan and prepare a high quality tutorial, we pay a stipend of $1,000 to each instructor (or team of instructors) for each half-day session they lead. You may find additional information at the link below, and the deadline for submission is February 11. https://www.scipy2020.scipy.org/tutorials You can find examples of successful submissions on GitHub: https://github.com/scipy-conference/scipy-conference Please feel free to email with any questions. Kind regards, Scipy 2020 Tutorial Chairs Serah Njambi Rono Alexandre Chabot-Leclerc Mike Hearne From jorisvandenbossche at gmail.com Mon Feb 10 12:43:21 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Mon, 10 Feb 2020 18:43:21 +0100 Subject: [Pandas-dev] What could a pandas 2.0 look like? Message-ID: pandas 1.0 is out, so time to start thinking about 2.0 ;) In principle, pandas 2.0 will just be one of the next releases when we decide we want to clean-up the deprecations / make a few changes that are hard to deprecate (following our new versioning policy). But nonetheless, I think it can still be interesting to think about it if it can also be something more than that, and have more specific goals in mind*. Last year I made the pd.NA proposal, which resulted in using that for the nullale integer, boolean and string dtypes. In the proposal, pd.NA was described as "can be used consistently across all data types". And for me, the aspirational end goal of this proposal is to *actually* have this for *all* dtypes, but we never really discussed this aspect explicitly. So, for me, a possible future pandas 2.0: - Uses all "nullable dtypes" by default (i.e. dtypes that use pd.NA as missing value indicator). That means we add a nullable version of all other dtypes (as we now already did for int, boolean, string). End goal: a single missing value indicator with the same behavior for all dtypes. - If we add such nullable dtypes using the extension dtypes/array mechanism (so it can first be opt-in in 1.X), this could "automatically" lead to a simplification of the internals / Block Manager (another aspirational goal that has been discussed before, but never became concrete). Because in such a case (all extension dtypes), we would only be using 1D blocks (simplifying the 1D / 2D thorny cases in internals). This simplifies the memory model, consolidation, etc Do you think this is a desirable goal? And realistic? Other aspirational goals? Best, Joris *Agreeing on goals doesn't mean it will happen, that's open source (or at least community-based open source). But I think it can still be useful to guide some efforts where possible or in trying to get traction for certain issues from contributors. And then we can still see if it gets done in 2.0, 3.0, 4.0 or never ;) -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Tue Feb 11 10:02:57 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Tue, 11 Feb 2020 09:02:57 -0600 Subject: [Pandas-dev] Monthly Dev Meeting - February 202 Message-ID: Hi all, The next monthly dev call is tomorrow (Wednesday, February 12th) at 18:00 UTC. We invite all to come. https://dev.pandas.io/docs/development/meeting.html We'll use this Zoom link: https://zoom.us/j/942410248 The agenda is at https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing if you have topics to discuss. Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From irv at princeton.com Tue Feb 11 12:13:35 2020 From: irv at princeton.com (Irv Lustig) Date: Tue, 11 Feb 2020 12:13:35 -0500 Subject: [Pandas-dev] What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: Joris: Another aspirational goal for pandas 2.0 would be to clean up the API so that index names and column names are treated equivalently throughout. I created a meta-issue for this 6+ months ago: https://github.com/pandas-dev/pandas/issues/27652 Dr-Irv > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 10 Feb 2020 18:43:21 +0100 > From: Joris Van den Bossche > To: pandas-dev > Subject: [Pandas-dev] What could a pandas 2.0 look like? > Message-ID: > < > CALQtMBZdrFD7iiNwOQeXs94tYxLqLb-otoJcSfvTX9meQnPcyw at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > pandas 1.0 is out, so time to start thinking about 2.0 ;) > > In principle, pandas 2.0 will just be one of the next releases when we > decide we want to clean-up the deprecations / make a few changes that are > hard to deprecate (following our new versioning policy). > But nonetheless, I think it can still be interesting to think about it if > it can also be something more than that, and have more specific goals in > mind*. > > Last year I made the pd.NA proposal, which resulted in using that for the > nullale integer, boolean and string dtypes. In the proposal, pd.NA was > described as "can be used consistently across all data types". And for me, > the aspirational end goal of this proposal is to *actually* have this for > *all* dtypes, but we never really discussed this aspect explicitly. > > So, for me, a possible future pandas 2.0: > > - Uses all "nullable dtypes" by default (i.e. dtypes that use pd.NA as > missing value indicator). That means we add a nullable version of all > other > dtypes (as we now already did for int, boolean, string). End goal: a > single > missing value indicator with the same behavior for all dtypes. > - If we add such nullable dtypes using the extension dtypes/array > mechanism (so it can first be opt-in in 1.X), this could "automatically" > lead to a simplification of the internals / Block Manager (another > aspirational goal that has been discussed before, but never became > concrete). Because in such a case (all extension dtypes), we would only > be > using 1D blocks (simplifying the 1D / 2D thorny cases in internals). > This > simplifies the memory model, consolidation, etc > > Do you think this is a desirable goal? And realistic? Other aspirational > goals? > > Best, > Joris > > *Agreeing on goals doesn't mean it will happen, that's open source (or at > least community-based open source). But I think it can still be useful to > guide some efforts where possible or in trying to get traction for certain > issues from contributors. And then we can still see if it gets done in 2.0, > 3.0, 4.0 or never ;) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Wed Feb 12 18:28:48 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Wed, 12 Feb 2020 17:28:48 -0600 Subject: [Pandas-dev] What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: Thanks Joris. This was discussed on the call today: https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing. I'll try to summarize the discussion here. On NA by default, there were a few concerns, none of which is likely a blocker. Things like the memory overhead of masks can be improved by making them optional (relatively easy) and possibly using a bitmask (probably harder). I wondered if this was blocked by the BlockManager being written in Python. This change would imply that blockwise ops would become columwise, so we'll have more overhead for some ops. Joris cited a bit of work he did to make this not too bad, at least for not too wide of tables. I also wondered whether this would be inappropriate as long as NA lives in pandas, rather than something that is understood by the entire scientific python ecosystem. It's worth thinking about and seeing how the community reacts to NA. Probably not a blocker. This would also imply creating a nullable float dtype and making our datelikes use NA rather than NaT too. That seemed to be generally OK, but wasn't discussed too much. --- Other "2.0" topics included rethinking our dependencies. It's possible Arrow could be added. Going nullable by default would make Arrow a pretty attractive option for storing arrays. But we would need to consult with our downstream dependencies (like xarray) and users about that. Fixing __getitem__ was also discussed. That will take someone to write up a detailed proposal about specific proposed changes and possible deprecation paths. Tom On Mon, Feb 10, 2020 at 11:43 AM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > pandas 1.0 is out, so time to start thinking about 2.0 ;) > > In principle, pandas 2.0 will just be one of the next releases when we > decide we want to clean-up the deprecations / make a few changes that are > hard to deprecate (following our new versioning policy). > But nonetheless, I think it can still be interesting to think about it if > it can also be something more than that, and have more specific goals in > mind*. > > Last year I made the pd.NA proposal, which resulted in using that for the > nullale integer, boolean and string dtypes. In the proposal, pd.NA was > described as "can be used consistently across all data types". And for me, > the aspirational end goal of this proposal is to *actually* have this for > *all* dtypes, but we never really discussed this aspect explicitly. > > So, for me, a possible future pandas 2.0: > > - Uses all "nullable dtypes" by default (i.e. dtypes that use pd.NA as > missing value indicator). That means we add a nullable version of all other > dtypes (as we now already did for int, boolean, string). End goal: a single > missing value indicator with the same behavior for all dtypes. > - If we add such nullable dtypes using the extension dtypes/array > mechanism (so it can first be opt-in in 1.X), this could "automatically" > lead to a simplification of the internals / Block Manager (another > aspirational goal that has been discussed before, but never became > concrete). Because in such a case (all extension dtypes), we would only be > using 1D blocks (simplifying the 1D / 2D thorny cases in internals). This > simplifies the memory model, consolidation, etc > > Do you think this is a desirable goal? And realistic? Other aspirational > goals? > > Best, > Joris > > *Agreeing on goals doesn't mean it will happen, that's open source (or at > least community-based open source). But I think it can still be useful to > guide some efforts where possible or in trying to get traction for certain > issues from contributors. And then we can still see if it gets done in 2.0, > 3.0, 4.0 or never ;) > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Thu Feb 13 17:23:20 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Thu, 13 Feb 2020 14:23:20 -0800 Subject: [Pandas-dev] What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: > This would also imply creating a nullable float dtype and making our datelikes use NA rather than NaT too. That seemed to be generally OK, but wasn't discussed too much. My understanding of the discussion is that using a mask on top of datetimelike arrays would not _replace_ NaT, but supplement it with something semantically different. Replacing NaT with NA breaks arithmetic consistency, as has been discussed ad nauseum. On Wed, Feb 12, 2020 at 3:29 PM Tom Augspurger wrote: > Thanks Joris. > > This was discussed on the call today: > https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing. > I'll try to summarize the discussion here. > > On NA by default, there were a few concerns, none of which is likely a > blocker. Things like the memory overhead of masks can be improved by making > them optional (relatively easy) and possibly using a bitmask (probably > harder). > > I wondered if this was blocked by the BlockManager being written in > Python. This change would imply that blockwise ops would become columwise, > so we'll have more overhead for some ops. Joris cited a bit of work he did > to make this not too bad, at least for not too wide of tables. > > I also wondered whether this would be inappropriate as long as NA lives in > pandas, rather than something that is understood by the entire scientific > python ecosystem. It's worth thinking about and seeing how the community > reacts to NA. Probably not a blocker. > > This would also imply creating a nullable float dtype and making our > datelikes use NA rather than NaT too. That seemed to be generally OK, but > wasn't discussed too much. > > --- > > Other "2.0" topics included rethinking our dependencies. It's possible > Arrow could be added. Going nullable by default would make Arrow a pretty > attractive option for storing arrays. But we would need to consult with our > downstream dependencies (like xarray) and users about that. > > Fixing __getitem__ was also discussed. That will take someone to write up > a detailed proposal about specific proposed changes and possible > deprecation paths. > > Tom > > On Mon, Feb 10, 2020 at 11:43 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> pandas 1.0 is out, so time to start thinking about 2.0 ;) >> >> In principle, pandas 2.0 will just be one of the next releases when we >> decide we want to clean-up the deprecations / make a few changes that are >> hard to deprecate (following our new versioning policy). >> But nonetheless, I think it can still be interesting to think about it if >> it can also be something more than that, and have more specific goals in >> mind*. >> >> Last year I made the pd.NA proposal, which resulted in using that for >> the nullale integer, boolean and string dtypes. In the proposal, pd.NA was >> described as "can be used consistently across all data types". And for me, >> the aspirational end goal of this proposal is to *actually* have this >> for *all* dtypes, but we never really discussed this aspect explicitly. >> >> So, for me, a possible future pandas 2.0: >> >> - Uses all "nullable dtypes" by default (i.e. dtypes that use pd.NA as >> missing value indicator). That means we add a nullable version of all other >> dtypes (as we now already did for int, boolean, string). End goal: a single >> missing value indicator with the same behavior for all dtypes. >> - If we add such nullable dtypes using the extension dtypes/array >> mechanism (so it can first be opt-in in 1.X), this could "automatically" >> lead to a simplification of the internals / Block Manager (another >> aspirational goal that has been discussed before, but never became >> concrete). Because in such a case (all extension dtypes), we would only be >> using 1D blocks (simplifying the 1D / 2D thorny cases in internals). This >> simplifies the memory model, consolidation, etc >> >> Do you think this is a desirable goal? And realistic? Other aspirational >> goals? >> >> Best, >> Joris >> >> *Agreeing on goals doesn't mean it will happen, that's open source (or at >> least community-based open source). But I think it can still be useful to >> guide some efforts where possible or in trying to get traction for certain >> issues from contributors. And then we can still see if it gets done in 2.0, >> 3.0, 4.0 or never ;) >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Fri Feb 14 16:02:19 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Fri, 14 Feb 2020 15:02:19 -0600 Subject: [Pandas-dev] What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: > Replacing NaT with NA breaks arithmetic consistency This means the result dtype of a Series & scalar, right? If so, it's worth deciding whether that's more valuable than consistency in the behavior of missing values in arithmetic and comparison operations. On Thu, Feb 13, 2020 at 4:23 PM Brock Mendel wrote: > > This would also imply creating a nullable float dtype and making our > datelikes use NA rather than NaT too. That seemed to be generally OK, but > wasn't discussed too much. > > My understanding of the discussion is that using a mask on top of > datetimelike arrays would not _replace_ NaT, but supplement it with > something semantically different. Replacing NaT with NA breaks arithmetic > consistency, as has been discussed ad nauseum. > > On Wed, Feb 12, 2020 at 3:29 PM Tom Augspurger > wrote: > >> Thanks Joris. >> >> This was discussed on the call today: >> https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing. >> I'll try to summarize the discussion here. >> >> On NA by default, there were a few concerns, none of which is likely a >> blocker. Things like the memory overhead of masks can be improved by making >> them optional (relatively easy) and possibly using a bitmask (probably >> harder). >> >> I wondered if this was blocked by the BlockManager being written in >> Python. This change would imply that blockwise ops would become columwise, >> so we'll have more overhead for some ops. Joris cited a bit of work he did >> to make this not too bad, at least for not too wide of tables. >> >> I also wondered whether this would be inappropriate as long as NA lives >> in pandas, rather than something that is understood by the entire >> scientific python ecosystem. It's worth thinking about and seeing how the >> community reacts to NA. Probably not a blocker. >> >> This would also imply creating a nullable float dtype and making our >> datelikes use NA rather than NaT too. That seemed to be generally OK, but >> wasn't discussed too much. >> >> --- >> >> Other "2.0" topics included rethinking our dependencies. It's possible >> Arrow could be added. Going nullable by default would make Arrow a pretty >> attractive option for storing arrays. But we would need to consult with our >> downstream dependencies (like xarray) and users about that. >> >> Fixing __getitem__ was also discussed. That will take someone to write up >> a detailed proposal about specific proposed changes and possible >> deprecation paths. >> >> Tom >> >> On Mon, Feb 10, 2020 at 11:43 AM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> pandas 1.0 is out, so time to start thinking about 2.0 ;) >>> >>> In principle, pandas 2.0 will just be one of the next releases when we >>> decide we want to clean-up the deprecations / make a few changes that are >>> hard to deprecate (following our new versioning policy). >>> But nonetheless, I think it can still be interesting to think about it >>> if it can also be something more than that, and have more specific goals in >>> mind*. >>> >>> Last year I made the pd.NA proposal, which resulted in using that for >>> the nullale integer, boolean and string dtypes. In the proposal, pd.NA was >>> described as "can be used consistently across all data types". And for me, >>> the aspirational end goal of this proposal is to *actually* have this >>> for *all* dtypes, but we never really discussed this aspect explicitly. >>> >>> So, for me, a possible future pandas 2.0: >>> >>> - Uses all "nullable dtypes" by default (i.e. dtypes that use pd.NA as >>> missing value indicator). That means we add a nullable version of all other >>> dtypes (as we now already did for int, boolean, string). End goal: a single >>> missing value indicator with the same behavior for all dtypes. >>> - If we add such nullable dtypes using the extension dtypes/array >>> mechanism (so it can first be opt-in in 1.X), this could "automatically" >>> lead to a simplification of the internals / Block Manager (another >>> aspirational goal that has been discussed before, but never became >>> concrete). Because in such a case (all extension dtypes), we would only be >>> using 1D blocks (simplifying the 1D / 2D thorny cases in internals). This >>> simplifies the memory model, consolidation, etc >>> >>> Do you think this is a desirable goal? And realistic? Other aspirational >>> goals? >>> >>> Best, >>> Joris >>> >>> *Agreeing on goals doesn't mean it will happen, that's open source (or >>> at least community-based open source). But I think it can still be useful >>> to guide some efforts where possible or in trying to get traction for >>> certain issues from contributors. And then we can still see if it gets done >>> in 2.0, 3.0, 4.0 or never ;) >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From garcia.marc at gmail.com Mon Feb 17 05:12:50 2020 From: garcia.marc at gmail.com (Marc Garcia) Date: Mon, 17 Feb 2020 10:12:50 +0000 Subject: [Pandas-dev] Domain and hosting Message-ID: We've got a bit of a mess at the moment with the pandas domain and hosting. I'll try to leave things in a more reasonable way, but there are some decisions pending. My understanding from a thread in this list was that everybody was happy with using pandas.io, and we set up the domain for the new hosting, and also dev.pandas.io for the development version of the website (and the blog and anything published in our GitHub organization pages). >From the discussion in https://github.com/pandas-dev/pandas/issues/28528 seems like the preference is to keep the old pandas.pydata.org instead. Given that, I think we can get rid of the dev.pandas.io, and point pandas.pydata.org to the new server once it's ready. For the website, I think the agreement is to update it with the latest version from master. So, no dev.pandas.io. For the development (master) documentation, I think it can live in pandas.pydata.org/docs/dev/. The blog, I think the best is to have the posts as pages on the website, in a directory blog/, so we don't need to maintain separaterly, and it has the look and feel of the website. For the new hosting, I'll move everything (all old documentation versions) from the current server to the new one, and then set up that the website and the development docs are automatically updated. Tom, can you give me access to the current web server so I can fetch the data please? Once everything is working in the new server, we'll be able to see it in pandas.io, and when we're happy we can change the domain pandas.pydata.org to point to it, and disable pandas.io. Please let me know if there are objections to any of the above, otherwise I'll move forward. Cheers! -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Mon Feb 17 06:36:54 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Mon, 17 Feb 2020 12:36:54 +0100 Subject: [Pandas-dev] What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: > > > This would also imply creating a nullable float dtype and making our > datelikes use NA rather than NaT too. That seemed to be generally OK, but > wasn't discussed too much. > > My understanding of the discussion is that using a mask on top of > datetimelike arrays would not _replace_ NaT, but supplement it with > something semantically different. > Yes, if we see it similar as NaNs for floats (where NaN is a specific float value in the data array, while NAs are tracked in the mask array), then for datetimelike arrays we can do something similar. And the same discussions about to what extent to distinguish NaN and NA or whether we need to provide options that we are going to have for float dtypes, will also be relevant for datetimelike dtypes (but then for NaT and NA). But note that in practice, I *think* that the big majority of use cases will mostly use NA and not NaT in the data (eg when reading from files that have missing data). Replacing NaT with NA breaks arithmetic consistency, as has been discussed > ad nauseum. > It's not fully clear to me what you want to say with this, so a more detailed clarification is welcome (I mean, I understand the sentence and remember the discussion, but don't fully understand the point being made in context, or in what direction you think more discussion is needed). Assume we introduce a new "nullable datetime" dtype that uses a mask to track NAs, and can still have NaT in the values. In practice, this still means that we "replace NaT with NA" (because even though NaT is still possible, I think you would mostly get NAs as mentioned above; eg reading a file would now give NA instaed of NaT). So do you mean: "in my opinion, we should not do this" (what I just described above), because in practice that would mean breaking arithmetic consistency? Or that if we want to start using NA for datetimelike dtypes, you think "dtype-parametrized" NA values are necessary (so you can distinguish NA[datetime] and NA[timedelta] ?) Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From andy.terrel at gmail.com Mon Feb 17 08:09:16 2020 From: andy.terrel at gmail.com (Andy Ray Terrel) Date: Mon, 17 Feb 2020 07:09:16 -0600 Subject: [Pandas-dev] Domain and hosting In-Reply-To: References: Message-ID: On Mon, Feb 17, 2020 at 4:13 AM Marc Garcia wrote: > We've got a bit of a mess at the moment with the pandas domain and > hosting. I'll try to leave things in a more reasonable way, but there are > some decisions pending. > > My understanding from a thread in this list was that everybody was happy > with using pandas.io, and we set up the domain for the new hosting, and > also dev.pandas.io for the development version of the website (and the > blog and anything published in our GitHub organization pages). > > From the discussion in https://github.com/pandas-dev/pandas/issues/28528 > seems like the preference is to keep the old pandas.pydata.org instead. > Given that, I think we can get rid of the dev.pandas.io, and point > pandas.pydata.org to the new server once it's ready. > > For the website, I think the agreement is to update it with the latest > version from master. So, no dev.pandas.io. For the development (master) > documentation, I think it can live in pandas.pydata.org/docs/dev/. > > The blog, I think the best is to have the posts as pages on the website, > in a directory blog/, so we don't need to maintain separaterly, and it has > the look and feel of the website. > > For the new hosting, I'll move everything (all old documentation versions) > from the current server to the new one, and then set up that the website > and the development docs are automatically updated. Tom, can you give me > access to the current web server so I can fetch the data please? > Marc, send me your ssh-key and preferred login, I can get you access. NumFOCUS just got a deal with AWS and we have the Rackspace servers for another two years, so unless you are just dying to pay fees, let me get you free servers. > > Once everything is working in the new server, we'll be able to see it in > pandas.io, and when we're happy we can change the domain pandas.pydata.org > to point to it, and disable pandas.io. > > Please let me know if there are objections to any of the above, otherwise > I'll move forward. > > Cheers! > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Mon Feb 17 11:25:26 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Mon, 17 Feb 2020 08:25:26 -0800 Subject: [Pandas-dev] What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: > It's not fully clear to me what you want to say with this, so a more detailed clarification is welcome (I mean, I understand the sentence and remember the discussion, but don't fully understand the point being made in context, or in what direction you think more discussion is needed). I don't particularly think more discussion is needed, as this is a rehash of #28095, where this horse has already been beaten to death. As Tom noted here , using pd.NA in places where we currently use NaT breaks the usual identity (that we rely on A LOT) ```(array + array)[0].dtype <=> (array + array[0]).dtype``` (Yes, this holds only imperfectly for NaT because NaT serves as both NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse in #28095.) Also from #28095: ```Series[timedelta64] * pd.NaT``` unambiguously raises, but ```Series[timedelta64] * pd.NA``` could be timedelta64 > Assume we introduce a new "nullable datetime" dtype that uses a mask to track NAs, and can still have NaT in the values. In practice, this still means that we "replace NaT with NA" This strikes me as contradictory. > So do you mean: "in my opinion, we should not do this" (what I just described above), because in practice that would mean breaking arithmetic consistency? Or that if we want to start using NA for datetimelike dtypes, you think "dtype-parametrized" NA values are necessary (so you can distinguish NA[datetime] and NA[timedelta] ?) I think: 1) pd.NA solves an _actual_ problem which is that we used to use np.nan in places (categorical, object) where np.nan was semantically misleading. a) What these have in common is that they are in general non-arithmetic dtypes. b) This is an improvement, and I'm glad you put in the effort to make it happen. c) Trying to shoe-horn pd.NA into cases where it is semantically misleading based on the Highlander Principle is counter-productive. 2) The "only one NA value is simpler" argument strikes me as a solution in search of a problem. a) All the more so if you want this to supplement np.nan/pd.NaT instead of replace them. b) *the idea of replacing vs supplementing needs to be made much more explicit/clear* 3) The "dtype-parametrized" NA did come up in #28095, but I never advocated it. a) I am open to separating out a NaTimedelta (xref #24983) from pd.NaT, and don't particularly care what it is called. On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > > This would also imply creating a nullable float dtype and making our >> datelikes use NA rather than NaT too. That seemed to be generally OK, but >> wasn't discussed too much. >> >> My understanding of the discussion is that using a mask on top of >> datetimelike arrays would not _replace_ NaT, but supplement it with >> something semantically different. >> > > Yes, if we see it similar as NaNs for floats (where NaN is a specific > float value in the data array, while NAs are tracked in the mask array), > then for datetimelike arrays we can do something similar. And the same > discussions about to what extent to distinguish NaN and NA or whether we > need to provide options that we are going to have for float dtypes, will > also be relevant for datetimelike dtypes (but then for NaT and NA). > > But note that in practice, I *think* that the big majority of use cases > will mostly use NA and not NaT in the data (eg when reading from files that > have missing data). > > Replacing NaT with NA breaks arithmetic consistency, as has been discussed >> ad nauseum. >> > > It's not fully clear to me what you want to say with this, so a more > detailed clarification is welcome (I mean, I understand the sentence and > remember the discussion, but don't fully understand the point being made in > context, or in what direction you think more discussion is needed). > > Assume we introduce a new "nullable datetime" dtype that uses a mask to > track NAs, and can still have NaT in the values. In practice, this still > means that we "replace NaT with NA" (because even though NaT is still > possible, I think you would mostly get NAs as mentioned above; eg reading a > file would now give NA instaed of NaT). > So do you mean: "in my opinion, we should not do this" (what I just > described above), because in practice that would mean breaking arithmetic > consistency? Or that if we want to start using NA for datetimelike dtypes, > you think "dtype-parametrized" NA values are necessary (so you can > distinguish NA[datetime] and NA[timedelta] ?) > > Joris > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Mon Feb 17 11:33:52 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Mon, 17 Feb 2020 10:33:52 -0600 Subject: [Pandas-dev] What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: > 2) The "only one NA value is simpler" argument strikes me as a solution in search of a problem. I don't think that's correct. I think consistently propagating NA in comparison operations is a worthwhile goal. On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel wrote: > > It's not fully clear to me what you want to say with this, so a more > detailed clarification is welcome (I mean, I understand the sentence and > remember the discussion, but don't fully understand the point being made in > context, or in what direction you think more discussion is needed). > > I don't particularly think more discussion is needed, as this is a rehash > of #28095, where this horse has already been beaten to death. > > As Tom noted here > , > using pd.NA in places where we currently use NaT breaks the usual identity > (that we rely on A LOT) > > ```(array + array)[0].dtype <=> (array + array[0]).dtype``` > > (Yes, this holds only imperfectly for NaT because NaT serves as both > NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse > in #28095.) > > Also from #28095: > > ```Series[timedelta64] * pd.NaT``` unambiguously raises, but > ```Series[timedelta64] * pd.NA``` could be timedelta64 > > > Assume we introduce a new "nullable datetime" dtype that uses a mask to > track NAs, and can still have NaT in the values. In practice, this still > means that we "replace NaT with NA" > > This strikes me as contradictory. > > > So do you mean: "in my opinion, we should not do this" (what I just > described above), because in practice that would mean breaking arithmetic > consistency? Or that if we want to start using NA for datetimelike dtypes, > you think "dtype-parametrized" NA values are necessary (so you can > distinguish NA[datetime] and NA[timedelta] ?) > > I think: > > 1) pd.NA solves an _actual_ problem which is that we used to use np.nan in > places (categorical, object) where np.nan was semantically misleading. > a) What these have in common is that they are in general non-arithmetic > dtypes. > b) This is an improvement, and I'm glad you put in the effort to make > it happen. > c) Trying to shoe-horn pd.NA into cases where it is semantically > misleading based on the Highlander Principle is counter-productive. > > 2) The "only one NA value is simpler" argument strikes me as a solution in > search of a problem. > a) All the more so if you want this to supplement np.nan/pd.NaT instead > of replace them. > b) *the idea of replacing vs supplementing needs to be made much more > explicit/clear* > > 3) The "dtype-parametrized" NA did come up in #28095, but I never > advocated it. > a) I am open to separating out a NaTimedelta (xref #24983) from pd.NaT, > and don't particularly care what it is called. > > > On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> > This would also imply creating a nullable float dtype and making our >>> datelikes use NA rather than NaT too. That seemed to be generally OK, but >>> wasn't discussed too much. >>> >>> My understanding of the discussion is that using a mask on top of >>> datetimelike arrays would not _replace_ NaT, but supplement it with >>> something semantically different. >>> >> >> Yes, if we see it similar as NaNs for floats (where NaN is a specific >> float value in the data array, while NAs are tracked in the mask array), >> then for datetimelike arrays we can do something similar. And the same >> discussions about to what extent to distinguish NaN and NA or whether we >> need to provide options that we are going to have for float dtypes, will >> also be relevant for datetimelike dtypes (but then for NaT and NA). >> >> But note that in practice, I *think* that the big majority of use cases >> will mostly use NA and not NaT in the data (eg when reading from files that >> have missing data). >> >> Replacing NaT with NA breaks arithmetic consistency, as has been >>> discussed ad nauseum. >>> >> >> It's not fully clear to me what you want to say with this, so a more >> detailed clarification is welcome (I mean, I understand the sentence and >> remember the discussion, but don't fully understand the point being made in >> context, or in what direction you think more discussion is needed). >> >> Assume we introduce a new "nullable datetime" dtype that uses a mask to >> track NAs, and can still have NaT in the values. In practice, this still >> means that we "replace NaT with NA" (because even though NaT is still >> possible, I think you would mostly get NAs as mentioned above; eg reading a >> file would now give NA instaed of NaT). >> So do you mean: "in my opinion, we should not do this" (what I just >> described above), because in practice that would mean breaking arithmetic >> consistency? Or that if we want to start using NA for datetimelike dtypes, >> you think "dtype-parametrized" NA values are necessary (so you can >> distinguish NA[datetime] and NA[timedelta] ?) >> >> Joris >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Mon Feb 17 12:50:38 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Mon, 17 Feb 2020 09:50:38 -0800 Subject: [Pandas-dev] What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: > I think consistently propagating NA in comparison operations is a worthwhile goal. That's an argument for having a three-valued bool-dtype, not for replacing all other NA-like values. On Mon, Feb 17, 2020 at 8:34 AM Tom Augspurger wrote: > > 2) The "only one NA value is simpler" argument strikes me as a solution > in search of a problem. > > I don't think that's correct. I think consistently propagating NA in > comparison operations is a worthwhile goal. > > On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel > wrote: > >> > It's not fully clear to me what you want to say with this, so a more >> detailed clarification is welcome (I mean, I understand the sentence and >> remember the discussion, but don't fully understand the point being made in >> context, or in what direction you think more discussion is needed). >> >> I don't particularly think more discussion is needed, as this is a rehash >> of #28095, where this horse has already been beaten to death. >> >> As Tom noted here >> , >> using pd.NA in places where we currently use NaT breaks the usual identity >> (that we rely on A LOT) >> >> ```(array + array)[0].dtype <=> (array + array[0]).dtype``` >> >> (Yes, this holds only imperfectly for NaT because NaT serves as both >> NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse >> in #28095.) >> >> Also from #28095: >> >> ```Series[timedelta64] * pd.NaT``` unambiguously raises, but >> ```Series[timedelta64] * pd.NA``` could be timedelta64 >> >> > Assume we introduce a new "nullable datetime" dtype that uses a mask to >> track NAs, and can still have NaT in the values. In practice, this still >> means that we "replace NaT with NA" >> >> This strikes me as contradictory. >> >> > So do you mean: "in my opinion, we should not do this" (what I just >> described above), because in practice that would mean breaking arithmetic >> consistency? Or that if we want to start using NA for datetimelike dtypes, >> you think "dtype-parametrized" NA values are necessary (so you can >> distinguish NA[datetime] and NA[timedelta] ?) >> >> I think: >> >> 1) pd.NA solves an _actual_ problem which is that we used to use np.nan >> in places (categorical, object) where np.nan was semantically misleading. >> a) What these have in common is that they are in general >> non-arithmetic dtypes. >> b) This is an improvement, and I'm glad you put in the effort to make >> it happen. >> c) Trying to shoe-horn pd.NA into cases where it is semantically >> misleading based on the Highlander Principle is counter-productive. >> >> 2) The "only one NA value is simpler" argument strikes me as a solution >> in search of a problem. >> a) All the more so if you want this to supplement np.nan/pd.NaT >> instead of replace them. >> b) *the idea of replacing vs supplementing needs to be made much more >> explicit/clear* >> >> 3) The "dtype-parametrized" NA did come up in #28095, but I never >> advocated it. >> a) I am open to separating out a NaTimedelta (xref #24983) from >> pd.NaT, and don't particularly care what it is called. >> >> >> On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> > This would also imply creating a nullable float dtype and making our >>>> datelikes use NA rather than NaT too. That seemed to be generally OK, but >>>> wasn't discussed too much. >>>> >>>> My understanding of the discussion is that using a mask on top of >>>> datetimelike arrays would not _replace_ NaT, but supplement it with >>>> something semantically different. >>>> >>> >>> Yes, if we see it similar as NaNs for floats (where NaN is a specific >>> float value in the data array, while NAs are tracked in the mask array), >>> then for datetimelike arrays we can do something similar. And the same >>> discussions about to what extent to distinguish NaN and NA or whether we >>> need to provide options that we are going to have for float dtypes, will >>> also be relevant for datetimelike dtypes (but then for NaT and NA). >>> >>> But note that in practice, I *think* that the big majority of use cases >>> will mostly use NA and not NaT in the data (eg when reading from files that >>> have missing data). >>> >>> Replacing NaT with NA breaks arithmetic consistency, as has been >>>> discussed ad nauseum. >>>> >>> >>> It's not fully clear to me what you want to say with this, so a more >>> detailed clarification is welcome (I mean, I understand the sentence and >>> remember the discussion, but don't fully understand the point being made in >>> context, or in what direction you think more discussion is needed). >>> >>> Assume we introduce a new "nullable datetime" dtype that uses a mask to >>> track NAs, and can still have NaT in the values. In practice, this still >>> means that we "replace NaT with NA" (because even though NaT is still >>> possible, I think you would mostly get NAs as mentioned above; eg reading a >>> file would now give NA instaed of NaT). >>> So do you mean: "in my opinion, we should not do this" (what I just >>> described above), because in practice that would mean breaking arithmetic >>> consistency? Or that if we want to start using NA for datetimelike dtypes, >>> you think "dtype-parametrized" NA values are necessary (so you can >>> distinguish NA[datetime] and NA[timedelta] ?) >>> >>> Joris >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Mon Feb 17 12:55:23 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Mon, 17 Feb 2020 11:55:23 -0600 Subject: [Pandas-dev] What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: Is NaT defined to be unequal in all comparisons, just like NaN? I think the goal of propagating NA requires either using NA or changing the behavior of NaT in comparisons to be like NA. On Mon, Feb 17, 2020 at 11:50 AM Brock Mendel wrote: > > I think consistently propagating NA in comparison operations is a > worthwhile goal. > > That's an argument for having a three-valued bool-dtype, not for replacing > all other NA-like values. > > On Mon, Feb 17, 2020 at 8:34 AM Tom Augspurger > wrote: > >> > 2) The "only one NA value is simpler" argument strikes me as a solution >> in search of a problem. >> >> I don't think that's correct. I think consistently propagating NA in >> comparison operations is a worthwhile goal. >> >> On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel >> wrote: >> >>> > It's not fully clear to me what you want to say with this, so a more >>> detailed clarification is welcome (I mean, I understand the sentence and >>> remember the discussion, but don't fully understand the point being made in >>> context, or in what direction you think more discussion is needed). >>> >>> I don't particularly think more discussion is needed, as this is a >>> rehash of #28095, where this horse has already been beaten to death. >>> >>> As Tom noted here >>> , >>> using pd.NA in places where we currently use NaT breaks the usual identity >>> (that we rely on A LOT) >>> >>> ```(array + array)[0].dtype <=> (array + array[0]).dtype``` >>> >>> (Yes, this holds only imperfectly for NaT because NaT serves as both >>> NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse >>> in #28095.) >>> >>> Also from #28095: >>> >>> ```Series[timedelta64] * pd.NaT``` unambiguously raises, but >>> ```Series[timedelta64] * pd.NA``` could be timedelta64 >>> >>> > Assume we introduce a new "nullable datetime" dtype that uses a mask >>> to track NAs, and can still have NaT in the values. In practice, this still >>> means that we "replace NaT with NA" >>> >>> This strikes me as contradictory. >>> >>> > So do you mean: "in my opinion, we should not do this" (what I just >>> described above), because in practice that would mean breaking arithmetic >>> consistency? Or that if we want to start using NA for datetimelike dtypes, >>> you think "dtype-parametrized" NA values are necessary (so you can >>> distinguish NA[datetime] and NA[timedelta] ?) >>> >>> I think: >>> >>> 1) pd.NA solves an _actual_ problem which is that we used to use np.nan >>> in places (categorical, object) where np.nan was semantically misleading. >>> a) What these have in common is that they are in general >>> non-arithmetic dtypes. >>> b) This is an improvement, and I'm glad you put in the effort to make >>> it happen. >>> c) Trying to shoe-horn pd.NA into cases where it is semantically >>> misleading based on the Highlander Principle is counter-productive. >>> >>> 2) The "only one NA value is simpler" argument strikes me as a solution >>> in search of a problem. >>> a) All the more so if you want this to supplement np.nan/pd.NaT >>> instead of replace them. >>> b) *the idea of replacing vs supplementing needs to be made much more >>> explicit/clear* >>> >>> 3) The "dtype-parametrized" NA did come up in #28095, but I never >>> advocated it. >>> a) I am open to separating out a NaTimedelta (xref #24983) from >>> pd.NaT, and don't particularly care what it is called. >>> >>> >>> On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche < >>> jorisvandenbossche at gmail.com> wrote: >>> >>>> > This would also imply creating a nullable float dtype and making our >>>>> datelikes use NA rather than NaT too. That seemed to be generally OK, but >>>>> wasn't discussed too much. >>>>> >>>>> My understanding of the discussion is that using a mask on top of >>>>> datetimelike arrays would not _replace_ NaT, but supplement it with >>>>> something semantically different. >>>>> >>>> >>>> Yes, if we see it similar as NaNs for floats (where NaN is a specific >>>> float value in the data array, while NAs are tracked in the mask array), >>>> then for datetimelike arrays we can do something similar. And the same >>>> discussions about to what extent to distinguish NaN and NA or whether we >>>> need to provide options that we are going to have for float dtypes, will >>>> also be relevant for datetimelike dtypes (but then for NaT and NA). >>>> >>>> But note that in practice, I *think* that the big majority of use >>>> cases will mostly use NA and not NaT in the data (eg when reading from >>>> files that have missing data). >>>> >>>> Replacing NaT with NA breaks arithmetic consistency, as has been >>>>> discussed ad nauseum. >>>>> >>>> >>>> It's not fully clear to me what you want to say with this, so a more >>>> detailed clarification is welcome (I mean, I understand the sentence and >>>> remember the discussion, but don't fully understand the point being made in >>>> context, or in what direction you think more discussion is needed). >>>> >>>> Assume we introduce a new "nullable datetime" dtype that uses a mask to >>>> track NAs, and can still have NaT in the values. In practice, this still >>>> means that we "replace NaT with NA" (because even though NaT is still >>>> possible, I think you would mostly get NAs as mentioned above; eg reading a >>>> file would now give NA instaed of NaT). >>>> So do you mean: "in my opinion, we should not do this" (what I just >>>> described above), because in practice that would mean breaking arithmetic >>>> consistency? Or that if we want to start using NA for datetimelike dtypes, >>>> you think "dtype-parametrized" NA values are necessary (so you can >>>> distinguish NA[datetime] and NA[timedelta] ?) >>>> >>>> Joris >>>> >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Mon Feb 17 12:58:28 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Mon, 17 Feb 2020 09:58:28 -0800 Subject: [Pandas-dev] What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: > or changing the behavior of NaT in comparisons to be like NA. Pending the kinks being worked out of pd.NA, I have no problem with that. On Mon, Feb 17, 2020 at 9:55 AM Tom Augspurger wrote: > Is NaT defined to be unequal in all comparisons, just like NaN? I think > the goal of propagating NA > requires either using NA or changing the behavior of NaT in comparisons to > be like NA. > > On Mon, Feb 17, 2020 at 11:50 AM Brock Mendel > wrote: > >> > I think consistently propagating NA in comparison operations is a >> worthwhile goal. >> >> That's an argument for having a three-valued bool-dtype, not for >> replacing all other NA-like values. >> >> On Mon, Feb 17, 2020 at 8:34 AM Tom Augspurger < >> tom.augspurger88 at gmail.com> wrote: >> >>> > 2) The "only one NA value is simpler" argument strikes me as a >>> solution in search of a problem. >>> >>> I don't think that's correct. I think consistently propagating NA in >>> comparison operations is a worthwhile goal. >>> >>> On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel >>> wrote: >>> >>>> > It's not fully clear to me what you want to say with this, so a more >>>> detailed clarification is welcome (I mean, I understand the sentence and >>>> remember the discussion, but don't fully understand the point being made in >>>> context, or in what direction you think more discussion is needed). >>>> >>>> I don't particularly think more discussion is needed, as this is a >>>> rehash of #28095, where this horse has already been beaten to death. >>>> >>>> As Tom noted here >>>> , >>>> using pd.NA in places where we currently use NaT breaks the usual identity >>>> (that we rely on A LOT) >>>> >>>> ```(array + array)[0].dtype <=> (array + array[0]).dtype``` >>>> >>>> (Yes, this holds only imperfectly for NaT because NaT serves as both >>>> NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse >>>> in #28095.) >>>> >>>> Also from #28095: >>>> >>>> ```Series[timedelta64] * pd.NaT``` unambiguously raises, but >>>> ```Series[timedelta64] * pd.NA``` could be timedelta64 >>>> >>>> > Assume we introduce a new "nullable datetime" dtype that uses a mask >>>> to track NAs, and can still have NaT in the values. In practice, this still >>>> means that we "replace NaT with NA" >>>> >>>> This strikes me as contradictory. >>>> >>>> > So do you mean: "in my opinion, we should not do this" (what I just >>>> described above), because in practice that would mean breaking arithmetic >>>> consistency? Or that if we want to start using NA for datetimelike dtypes, >>>> you think "dtype-parametrized" NA values are necessary (so you can >>>> distinguish NA[datetime] and NA[timedelta] ?) >>>> >>>> I think: >>>> >>>> 1) pd.NA solves an _actual_ problem which is that we used to use np.nan >>>> in places (categorical, object) where np.nan was semantically misleading. >>>> a) What these have in common is that they are in general >>>> non-arithmetic dtypes. >>>> b) This is an improvement, and I'm glad you put in the effort to >>>> make it happen. >>>> c) Trying to shoe-horn pd.NA into cases where it is semantically >>>> misleading based on the Highlander Principle is counter-productive. >>>> >>>> 2) The "only one NA value is simpler" argument strikes me as a solution >>>> in search of a problem. >>>> a) All the more so if you want this to supplement np.nan/pd.NaT >>>> instead of replace them. >>>> b) *the idea of replacing vs supplementing needs to be made much >>>> more explicit/clear* >>>> >>>> 3) The "dtype-parametrized" NA did come up in #28095, but I never >>>> advocated it. >>>> a) I am open to separating out a NaTimedelta (xref #24983) from >>>> pd.NaT, and don't particularly care what it is called. >>>> >>>> >>>> On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche < >>>> jorisvandenbossche at gmail.com> wrote: >>>> >>>>> > This would also imply creating a nullable float dtype and making our >>>>>> datelikes use NA rather than NaT too. That seemed to be generally OK, but >>>>>> wasn't discussed too much. >>>>>> >>>>>> My understanding of the discussion is that using a mask on top of >>>>>> datetimelike arrays would not _replace_ NaT, but supplement it with >>>>>> something semantically different. >>>>>> >>>>> >>>>> Yes, if we see it similar as NaNs for floats (where NaN is a specific >>>>> float value in the data array, while NAs are tracked in the mask array), >>>>> then for datetimelike arrays we can do something similar. And the same >>>>> discussions about to what extent to distinguish NaN and NA or whether we >>>>> need to provide options that we are going to have for float dtypes, will >>>>> also be relevant for datetimelike dtypes (but then for NaT and NA). >>>>> >>>>> But note that in practice, I *think* that the big majority of use >>>>> cases will mostly use NA and not NaT in the data (eg when reading from >>>>> files that have missing data). >>>>> >>>>> Replacing NaT with NA breaks arithmetic consistency, as has been >>>>>> discussed ad nauseum. >>>>>> >>>>> >>>>> It's not fully clear to me what you want to say with this, so a more >>>>> detailed clarification is welcome (I mean, I understand the sentence and >>>>> remember the discussion, but don't fully understand the point being made in >>>>> context, or in what direction you think more discussion is needed). >>>>> >>>>> Assume we introduce a new "nullable datetime" dtype that uses a mask >>>>> to track NAs, and can still have NaT in the values. In practice, this still >>>>> means that we "replace NaT with NA" (because even though NaT is still >>>>> possible, I think you would mostly get NAs as mentioned above; eg reading a >>>>> file would now give NA instaed of NaT). >>>>> So do you mean: "in my opinion, we should not do this" (what I just >>>>> described above), because in practice that would mean breaking arithmetic >>>>> consistency? Or that if we want to start using NA for datetimelike dtypes, >>>>> you think "dtype-parametrized" NA values are necessary (so you can >>>>> distinguish NA[datetime] and NA[timedelta] ?) >>>>> >>>>> Joris >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Mon Feb 17 13:06:27 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Mon, 17 Feb 2020 12:06:27 -0600 Subject: [Pandas-dev] What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: On Mon, Feb 17, 2020 at 11:58 AM Brock Mendel wrote: > > or changing the behavior of NaT in comparisons to be like NA. > > Pending the kinks being worked out of pd.NA, I have no problem with that. > You have no problem with changing the behavior of NaT, or changing to use pd.NA? Is changing the defined behavior of NaT even an option? Is it defined in a spec like NaN, or did NumPy just choose that behavior? Assuming NaT had NA-like behavior in comparisons, what's remaining arguments for keeping NaT? Preserving dtypes in scalar - array ops? Anything else? On Mon, Feb 17, 2020 at 9:55 AM Tom Augspurger > wrote: > >> Is NaT defined to be unequal in all comparisons, just like NaN? I think >> the goal of propagating NA >> requires either using NA or changing the behavior of NaT in comparisons >> to be like NA. >> >> On Mon, Feb 17, 2020 at 11:50 AM Brock Mendel >> wrote: >> >>> > I think consistently propagating NA in comparison operations is a >>> worthwhile goal. >>> >>> That's an argument for having a three-valued bool-dtype, not for >>> replacing all other NA-like values. >>> >>> On Mon, Feb 17, 2020 at 8:34 AM Tom Augspurger < >>> tom.augspurger88 at gmail.com> wrote: >>> >>>> > 2) The "only one NA value is simpler" argument strikes me as a >>>> solution in search of a problem. >>>> >>>> I don't think that's correct. I think consistently propagating NA in >>>> comparison operations is a worthwhile goal. >>>> >>>> On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel >>>> wrote: >>>> >>>>> > It's not fully clear to me what you want to say with this, so a more >>>>> detailed clarification is welcome (I mean, I understand the sentence and >>>>> remember the discussion, but don't fully understand the point being made in >>>>> context, or in what direction you think more discussion is needed). >>>>> >>>>> I don't particularly think more discussion is needed, as this is a >>>>> rehash of #28095, where this horse has already been beaten to death. >>>>> >>>>> As Tom noted here >>>>> , >>>>> using pd.NA in places where we currently use NaT breaks the usual identity >>>>> (that we rely on A LOT) >>>>> >>>>> ```(array + array)[0].dtype <=> (array + array[0]).dtype``` >>>>> >>>>> (Yes, this holds only imperfectly for NaT because NaT serves as both >>>>> NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse >>>>> in #28095.) >>>>> >>>>> Also from #28095: >>>>> >>>>> ```Series[timedelta64] * pd.NaT``` unambiguously raises, but >>>>> ```Series[timedelta64] * pd.NA``` could be timedelta64 >>>>> >>>>> > Assume we introduce a new "nullable datetime" dtype that uses a mask >>>>> to track NAs, and can still have NaT in the values. In practice, this still >>>>> means that we "replace NaT with NA" >>>>> >>>>> This strikes me as contradictory. >>>>> >>>>> > So do you mean: "in my opinion, we should not do this" (what I just >>>>> described above), because in practice that would mean breaking arithmetic >>>>> consistency? Or that if we want to start using NA for datetimelike dtypes, >>>>> you think "dtype-parametrized" NA values are necessary (so you can >>>>> distinguish NA[datetime] and NA[timedelta] ?) >>>>> >>>>> I think: >>>>> >>>>> 1) pd.NA solves an _actual_ problem which is that we used to use >>>>> np.nan in places (categorical, object) where np.nan was semantically >>>>> misleading. >>>>> a) What these have in common is that they are in general >>>>> non-arithmetic dtypes. >>>>> b) This is an improvement, and I'm glad you put in the effort to >>>>> make it happen. >>>>> c) Trying to shoe-horn pd.NA into cases where it is semantically >>>>> misleading based on the Highlander Principle is counter-productive. >>>>> >>>>> 2) The "only one NA value is simpler" argument strikes me as a >>>>> solution in search of a problem. >>>>> a) All the more so if you want this to supplement np.nan/pd.NaT >>>>> instead of replace them. >>>>> b) *the idea of replacing vs supplementing needs to be made much >>>>> more explicit/clear* >>>>> >>>>> 3) The "dtype-parametrized" NA did come up in #28095, but I never >>>>> advocated it. >>>>> a) I am open to separating out a NaTimedelta (xref #24983) from >>>>> pd.NaT, and don't particularly care what it is called. >>>>> >>>>> >>>>> On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche < >>>>> jorisvandenbossche at gmail.com> wrote: >>>>> >>>>>> > This would also imply creating a nullable float dtype and making >>>>>>> our datelikes use NA rather than NaT too. That seemed to be generally OK, >>>>>>> but wasn't discussed too much. >>>>>>> >>>>>>> My understanding of the discussion is that using a mask on top of >>>>>>> datetimelike arrays would not _replace_ NaT, but supplement it with >>>>>>> something semantically different. >>>>>>> >>>>>> >>>>>> Yes, if we see it similar as NaNs for floats (where NaN is a specific >>>>>> float value in the data array, while NAs are tracked in the mask array), >>>>>> then for datetimelike arrays we can do something similar. And the same >>>>>> discussions about to what extent to distinguish NaN and NA or whether we >>>>>> need to provide options that we are going to have for float dtypes, will >>>>>> also be relevant for datetimelike dtypes (but then for NaT and NA). >>>>>> >>>>>> But note that in practice, I *think* that the big majority of use >>>>>> cases will mostly use NA and not NaT in the data (eg when reading from >>>>>> files that have missing data). >>>>>> >>>>>> Replacing NaT with NA breaks arithmetic consistency, as has been >>>>>>> discussed ad nauseum. >>>>>>> >>>>>> >>>>>> It's not fully clear to me what you want to say with this, so a more >>>>>> detailed clarification is welcome (I mean, I understand the sentence and >>>>>> remember the discussion, but don't fully understand the point being made in >>>>>> context, or in what direction you think more discussion is needed). >>>>>> >>>>>> Assume we introduce a new "nullable datetime" dtype that uses a mask >>>>>> to track NAs, and can still have NaT in the values. In practice, this still >>>>>> means that we "replace NaT with NA" (because even though NaT is still >>>>>> possible, I think you would mostly get NAs as mentioned above; eg reading a >>>>>> file would now give NA instaed of NaT). >>>>>> So do you mean: "in my opinion, we should not do this" (what I just >>>>>> described above), because in practice that would mean breaking arithmetic >>>>>> consistency? Or that if we want to start using NA for datetimelike dtypes, >>>>>> you think "dtype-parametrized" NA values are necessary (so you can >>>>>> distinguish NA[datetime] and NA[timedelta] ?) >>>>>> >>>>>> Joris >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Pandas-dev mailing list >>>>>> Pandas-dev at python.org >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>> >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From garcia.marc at gmail.com Mon Feb 17 13:24:34 2020 From: garcia.marc at gmail.com (Marc Garcia) Date: Mon, 17 Feb 2020 18:24:34 +0000 Subject: [Pandas-dev] Domain and hosting In-Reply-To: References: Message-ID: The OVH hosting is also free. I thought Rackspace started sending invoices now, I saw on Twitter comments from some other projects about it. Part of the idea of moving to OVH was because they should provide us with the Binder infrastructure if we make the examples in the docs runnable. Not sure how we should move forward then. Should we simply disable or redirect pandas.io, and we set up the CI to update the website and the dev docs there? The current server seems to work well enough, and probably not worth moving things for now if we still have it for two more years. Does anyone have a different idea or a preference? On Mon, Feb 17, 2020 at 1:09 PM Andy Ray Terrel wrote: > > > On Mon, Feb 17, 2020 at 4:13 AM Marc Garcia wrote: > >> We've got a bit of a mess at the moment with the pandas domain and >> hosting. I'll try to leave things in a more reasonable way, but there are >> some decisions pending. >> >> My understanding from a thread in this list was that everybody was happy >> with using pandas.io, and we set up the domain for the new hosting, and >> also dev.pandas.io for the development version of the website (and the >> blog and anything published in our GitHub organization pages). >> >> From the discussion in https://github.com/pandas-dev/pandas/issues/28528 >> seems like the preference is to keep the old pandas.pydata.org instead. >> Given that, I think we can get rid of the dev.pandas.io, and point >> pandas.pydata.org to the new server once it's ready. >> >> For the website, I think the agreement is to update it with the latest >> version from master. So, no dev.pandas.io. For the development (master) >> documentation, I think it can live in pandas.pydata.org/docs/dev/. >> >> The blog, I think the best is to have the posts as pages on the website, >> in a directory blog/, so we don't need to maintain separaterly, and it has >> the look and feel of the website. >> >> For the new hosting, I'll move everything (all old documentation >> versions) from the current server to the new one, and then set up that the >> website and the development docs are automatically updated. Tom, can you >> give me access to the current web server so I can fetch the data please? >> > > Marc, send me your ssh-key and preferred login, I can get you access. > NumFOCUS just got a deal with AWS and we have the Rackspace servers for > another two years, so unless you are just dying to pay fees, let me get you > free servers. > > >> >> Once everything is working in the new server, we'll be able to see it in >> pandas.io, and when we're happy we can change the domain >> pandas.pydata.org to point to it, and disable pandas.io. >> >> Please let me know if there are objections to any of the above, otherwise >> I'll move forward. >> >> Cheers! >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andy.terrel at gmail.com Mon Feb 17 14:16:35 2020 From: andy.terrel at gmail.com (Andy Ray Terrel) Date: Mon, 17 Feb 2020 13:16:35 -0600 Subject: [Pandas-dev] Domain and hosting In-Reply-To: References: Message-ID: On Mon, Feb 17, 2020 at 12:24 PM Marc Garcia wrote: > The OVH hosting is also free. I thought Rackspace started sending invoices > now, I saw on Twitter comments from some other projects about it. > NumFOCUS has negotiated a 2 year extension. The twitter comments seems to be mostly tied to personal accounts. > > Part of the idea of moving to OVH was because they should provide us with > the Binder infrastructure if we make the examples in the docs runnable. > OVH is great but as far as I know there is no formal relationship. Sorry if there are details elsewhere that I've missed. Anywho, I just wanted to make sure you know I had resources to help if needed. > Not sure how we should move forward then. Should we simply disable or > redirect pandas.io, and we set up the CI to update the website and the > dev docs there? The current server seems to work well enough, and probably > not worth moving things for now if we still have it for two more years. > Does anyone have a different idea or a preference? > > On Mon, Feb 17, 2020 at 1:09 PM Andy Ray Terrel > wrote: > >> >> >> On Mon, Feb 17, 2020 at 4:13 AM Marc Garcia >> wrote: >> >>> We've got a bit of a mess at the moment with the pandas domain and >>> hosting. I'll try to leave things in a more reasonable way, but there are >>> some decisions pending. >>> >>> My understanding from a thread in this list was that everybody was happy >>> with using pandas.io, and we set up the domain for the new hosting, and >>> also dev.pandas.io for the development version of the website (and the >>> blog and anything published in our GitHub organization pages). >>> >>> From the discussion in https://github.com/pandas-dev/pandas/issues/28528 >>> seems like the preference is to keep the old pandas.pydata.org instead. >>> Given that, I think we can get rid of the dev.pandas.io, and point >>> pandas.pydata.org to the new server once it's ready. >>> >>> For the website, I think the agreement is to update it with the latest >>> version from master. So, no dev.pandas.io. For the development (master) >>> documentation, I think it can live in pandas.pydata.org/docs/dev/. >>> >>> The blog, I think the best is to have the posts as pages on the website, >>> in a directory blog/, so we don't need to maintain separaterly, and it has >>> the look and feel of the website. >>> >>> For the new hosting, I'll move everything (all old documentation >>> versions) from the current server to the new one, and then set up that the >>> website and the development docs are automatically updated. Tom, can you >>> give me access to the current web server so I can fetch the data please? >>> >> >> Marc, send me your ssh-key and preferred login, I can get you access. >> NumFOCUS just got a deal with AWS and we have the Rackspace servers for >> another two years, so unless you are just dying to pay fees, let me get you >> free servers. >> >> >>> >>> Once everything is working in the new server, we'll be able to see it in >>> pandas.io, and when we're happy we can change the domain >>> pandas.pydata.org to point to it, and disable pandas.io. >>> >>> Please let me know if there are objections to any of the above, >>> otherwise I'll move forward. >>> >>> Cheers! >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Tue Feb 18 12:20:01 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Tue, 18 Feb 2020 11:20:01 -0600 Subject: [Pandas-dev] Fwd: What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: (Accidentally dropped the mailing list) On Mon, Feb 17, 2020 at 7:17 PM Brock Mendel wrote: > > You have no problem with changing the behavior of NaT, or changing to > use pd.NA? > > If/when we get to a point where we propagate NAs in all other comparisons, > I would have no problem with editing `NaT.__richcmp__` to match that > convention. > What are the advantages of a NaT with NA-like comparison semantics over using NA (or NA[datetime])? 1. Retain dtype in array - scalar ops with a scalar NA 2. ... 3. Less disruptive than changing to NA My ... could include things like `isinstance(NaT, Timestamp)` being true and `NaT.` for Timestamp attributes. But those don't strike me as necessarily good things. They seem sometimes useful and sometimes harmful. The downside of changing NaT in comparison operations are 1. We're diverging from `np.NaT`. I don't know how problematic this actually is. 2. It's a special case. Should users need to know that datelikes use their own NA value because the underlying storage is able to store them "in-band" rather than as a mask? My gut reaction is "no, users shouldn't be exposed to this." 3. Changing NaT would leave just NaN with the "always unequal in comparisons" behavior. Thus far, I see three options going forward 1. Use NaN for floats, NaT for datelikes, NA for other. 1-a: Leave NaT with always unequal 1-b: Change NaT to have NA-like comparison behavior 2. Use NA everywhere (no NaN for float, no NaT for datelike 3. Implement a typed `NA`, where we have an `NA` per dtype. Option 3 I think solves the array - scalar op issue. It's more complex for users though hopefully not too complex? My biggest worry is that it makes the implementation much more complex, though perhaps I'm being pessimistic. On balance, I'm not sure where I come down yet. Good news: we can take time to figure this out :) > On Mon, Feb 17, 2020 at 10:06 AM Tom Augspurger < > tom.augspurger88 at gmail.com> wrote: > >> >> >> >> On Mon, Feb 17, 2020 at 11:58 AM Brock Mendel >> wrote: >> >>> > or changing the behavior of NaT in comparisons to be like NA. >>> >>> Pending the kinks being worked out of pd.NA, I have no problem with that. >>> >> >> You have no problem with changing the behavior of NaT, or changing to use >> pd.NA? >> >> Is changing the defined behavior of NaT even an option? Is it defined in >> a spec >> like NaN, or did NumPy just choose that behavior? >> >> Assuming NaT had NA-like behavior in comparisons, what's remaining >> arguments for keeping NaT? >> Preserving dtypes in scalar - array ops? Anything else? >> >> On Mon, Feb 17, 2020 at 9:55 AM Tom Augspurger < >>> tom.augspurger88 at gmail.com> wrote: >>> >>>> Is NaT defined to be unequal in all comparisons, just like NaN? I think >>>> the goal of propagating NA >>>> requires either using NA or changing the behavior of NaT in comparisons >>>> to be like NA. >>>> >>>> On Mon, Feb 17, 2020 at 11:50 AM Brock Mendel >>>> wrote: >>>> >>>>> > I think consistently propagating NA in comparison operations is a >>>>> worthwhile goal. >>>>> >>>>> That's an argument for having a three-valued bool-dtype, not for >>>>> replacing all other NA-like values. >>>>> >>>>> On Mon, Feb 17, 2020 at 8:34 AM Tom Augspurger < >>>>> tom.augspurger88 at gmail.com> wrote: >>>>> >>>>>> > 2) The "only one NA value is simpler" argument strikes me as a >>>>>> solution in search of a problem. >>>>>> >>>>>> I don't think that's correct. I think consistently propagating NA in >>>>>> comparison operations is a worthwhile goal. >>>>>> >>>>>> On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel >>>>>> wrote: >>>>>> >>>>>>> > It's not fully clear to me what you want to say with this, so a >>>>>>> more detailed clarification is welcome (I mean, I understand the sentence >>>>>>> and remember the discussion, but don't fully understand the point being >>>>>>> made in context, or in what direction you think more discussion is needed). >>>>>>> >>>>>>> I don't particularly think more discussion is needed, as this is a >>>>>>> rehash of #28095, where this horse has already been beaten to death. >>>>>>> >>>>>>> As Tom noted here >>>>>>> , >>>>>>> using pd.NA in places where we currently use NaT breaks the usual identity >>>>>>> (that we rely on A LOT) >>>>>>> >>>>>>> ```(array + array)[0].dtype <=> (array + array[0]).dtype``` >>>>>>> >>>>>>> (Yes, this holds only imperfectly for NaT because NaT serves as both >>>>>>> NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse >>>>>>> in #28095.) >>>>>>> >>>>>>> Also from #28095: >>>>>>> >>>>>>> ```Series[timedelta64] * pd.NaT``` unambiguously raises, but >>>>>>> ```Series[timedelta64] * pd.NA``` could be timedelta64 >>>>>>> >>>>>>> > Assume we introduce a new "nullable datetime" dtype that uses a >>>>>>> mask to track NAs, and can still have NaT in the values. In practice, this >>>>>>> still means that we "replace NaT with NA" >>>>>>> >>>>>>> This strikes me as contradictory. >>>>>>> >>>>>>> > So do you mean: "in my opinion, we should not do this" (what I >>>>>>> just described above), because in practice that would mean breaking >>>>>>> arithmetic consistency? Or that if we want to start using NA for >>>>>>> datetimelike dtypes, you think "dtype-parametrized" NA values are necessary >>>>>>> (so you can distinguish NA[datetime] and NA[timedelta] ?) >>>>>>> >>>>>>> I think: >>>>>>> >>>>>>> 1) pd.NA solves an _actual_ problem which is that we used to use >>>>>>> np.nan in places (categorical, object) where np.nan was semantically >>>>>>> misleading. >>>>>>> a) What these have in common is that they are in general >>>>>>> non-arithmetic dtypes. >>>>>>> b) This is an improvement, and I'm glad you put in the effort to >>>>>>> make it happen. >>>>>>> c) Trying to shoe-horn pd.NA into cases where it is semantically >>>>>>> misleading based on the Highlander Principle is counter-productive. >>>>>>> >>>>>>> 2) The "only one NA value is simpler" argument strikes me as a >>>>>>> solution in search of a problem. >>>>>>> a) All the more so if you want this to supplement np.nan/pd.NaT >>>>>>> instead of replace them. >>>>>>> b) *the idea of replacing vs supplementing needs to be made much >>>>>>> more explicit/clear* >>>>>>> >>>>>>> 3) The "dtype-parametrized" NA did come up in #28095, but I never >>>>>>> advocated it. >>>>>>> a) I am open to separating out a NaTimedelta (xref #24983) from >>>>>>> pd.NaT, and don't particularly care what it is called. >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche < >>>>>>> jorisvandenbossche at gmail.com> wrote: >>>>>>> >>>>>>>> > This would also imply creating a nullable float dtype and making >>>>>>>>> our datelikes use NA rather than NaT too. That seemed to be generally OK, >>>>>>>>> but wasn't discussed too much. >>>>>>>>> >>>>>>>>> My understanding of the discussion is that using a mask on top of >>>>>>>>> datetimelike arrays would not _replace_ NaT, but supplement it with >>>>>>>>> something semantically different. >>>>>>>>> >>>>>>>> >>>>>>>> Yes, if we see it similar as NaNs for floats (where NaN is a >>>>>>>> specific float value in the data array, while NAs are tracked in the mask >>>>>>>> array), then for datetimelike arrays we can do something similar. And the >>>>>>>> same discussions about to what extent to distinguish NaN and NA or whether >>>>>>>> we need to provide options that we are going to have for float dtypes, will >>>>>>>> also be relevant for datetimelike dtypes (but then for NaT and NA). >>>>>>>> >>>>>>>> But note that in practice, I *think* that the big majority of use >>>>>>>> cases will mostly use NA and not NaT in the data (eg when reading from >>>>>>>> files that have missing data). >>>>>>>> >>>>>>>> Replacing NaT with NA breaks arithmetic consistency, as has been >>>>>>>>> discussed ad nauseum. >>>>>>>>> >>>>>>>> >>>>>>>> It's not fully clear to me what you want to say with this, so a >>>>>>>> more detailed clarification is welcome (I mean, I understand the sentence >>>>>>>> and remember the discussion, but don't fully understand the point being >>>>>>>> made in context, or in what direction you think more discussion is needed). >>>>>>>> >>>>>>>> Assume we introduce a new "nullable datetime" dtype that uses a >>>>>>>> mask to track NAs, and can still have NaT in the values. In practice, this >>>>>>>> still means that we "replace NaT with NA" (because even though NaT is still >>>>>>>> possible, I think you would mostly get NAs as mentioned above; eg reading a >>>>>>>> file would now give NA instaed of NaT). >>>>>>>> So do you mean: "in my opinion, we should not do this" (what I just >>>>>>>> described above), because in practice that would mean breaking arithmetic >>>>>>>> consistency? Or that if we want to start using NA for datetimelike dtypes, >>>>>>>> you think "dtype-parametrized" NA values are necessary (so you can >>>>>>>> distinguish NA[datetime] and NA[timedelta] ?) >>>>>>>> >>>>>>>> Joris >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Pandas-dev mailing list >>>>>>>> Pandas-dev at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Pandas-dev mailing list >>>>>>> Pandas-dev at python.org >>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>>> >>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Feb 19 17:46:37 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 19 Feb 2020 23:46:37 +0100 Subject: [Pandas-dev] What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: Some answers to previous mails first: On Mon, 17 Feb 2020 at 17:34, Tom Augspurger wrote: > > 2) The "only one NA value is simpler" argument strikes me as a solution > in search of a problem. > > I don't think that's correct. I think consistently propagating NA in > comparison operations is a worthwhile goal. > > Having a single, consistent missing value indicator across all dtypes is *for me* one of the main drivers that led me to make the pd.NA proposal. >From my personal experience (eg when teaching pandas to beginners), this is an existing problem that complicates things, not one that is being invented. > On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel > wrote: > >> >> > Assume we introduce a new "nullable datetime" dtype that uses a mask to >> track NAs, and can still have NaT in the values. In practice, this still >> means that we "replace NaT with NA" >> >> This strikes me as contradictory. >> > I tried to explain this in the next sentence from the original text: "(because even though NaT is still possible, I think you would mostly get NAs as mentioned above; eg reading a file would now give NA instead of NaT)." So assuming you have a masked-array-approach for datetimes, then you can have NaT as a valid datetime value in the values part or NA due to the mask part of the array. In such a case (but this is only an assumption how the extension array *could* work!), it's the NA that is the main missing value indicator. So if you are creating such a masked datetime-like array with missing values (eg from reading a file), you will get NAs as missing values in this case in contrast to NaTs right now. Hence, in practice we would "replace NaT with NA", although you can still have NaT in the values. Note I only started to explain this in response to your initial "using a mask on top of datetimelike arrays would not _replace_ NaT, but supplement it with something semantically different", but maybe I misunderstood your initial comment. > >> > So do you mean: "in my opinion, we should not do this" (what I just >> described above), because in practice that would mean breaking arithmetic >> consistency? Or that if we want to start using NA for datetimelike dtypes, >> you think "dtype-parametrized" NA values are necessary (so you can >> distinguish NA[datetime] and NA[timedelta] ?) >> >> I think: >> >> 1) pd.NA solves an _actual_ problem which is that we used to use np.nan >> in places (categorical, object) where np.nan was semantically misleading. >> a) What these have in common is that they are in general >> non-arithmetic dtypes. >> b) This is an improvement, and I'm glad you put in the effort to make >> it happen. >> c) Trying to shoe-horn pd.NA into cases where it is semantically >> misleading based on the Highlander Principle is counter-productive. >> > With "semantically misleading", I suppose you mean that "Series[Timedelta] + pd.NA" could result both in timedelta or datetime64? Personally, I don't think this is big problem (or at least I think a single pd.NA brings bigger benefits), but this has indeed already discussed > >> 2) The "only one NA value is simpler" argument strikes me as a solution >> in search of a problem. >> a) All the more so if you want this to supplement np.nan/pd.NaT >> instead of replace them. >> b) *the idea of replacing vs supplementing needs to be made much more >> explicit/clear* >> >> I thought you were actually advocating for supplementing instead of replacing in your first email ;) (but maybe you were rather trying to summarize the discussion, and not giving an opinion?) Anyway, I will open an github issue for float dtypes to further discuss this (I think the float case is the easier to discuss, while the issues are similar to datetime with NaT). -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Feb 19 17:55:27 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 19 Feb 2020 23:55:27 +0100 Subject: [Pandas-dev] Fwd: What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: On Tue, 18 Feb 2020 at 18:20, Tom Augspurger wrote: > > On Mon, Feb 17, 2020 at 7:17 PM Brock Mendel > wrote: > >> > You have no problem with changing the behavior of NaT, or changing to >> use pd.NA? >> >> If/when we get to a point where we propagate NAs in all other >> comparisons, I would have no problem with editing `NaT.__richcmp__` to >> match that convention. >> > > What are the advantages of a NaT with NA-like comparison semantics over > using NA > (or NA[datetime])? > > 1. Retain dtype in array - scalar ops with a scalar NA > 2. ... > 3. Less disruptive than changing to NA > > My ... could include things like `isinstance(NaT, Timestamp)` being true > and > `NaT.` for Timestamp attributes. But those don't strike me as > necessarily > good things. They seem sometimes useful and sometimes harmful. > > The downside of changing NaT in comparison operations are > > 1. We're diverging from `np.NaT`. I don't know how problematic this > actually is. > 2. It's a special case. Should users need to know that datelikes use their > own > NA value because the underlying storage is able to store them "in-band" > rather than as a mask? My gut reaction is "no, users shouldn't be > exposed to > this." > 3. Changing NaT would leave just NaN with the "always unequal in > comparisons" > behavior. > Personally, I think changing the behaviour of NaT in pandas, and thus deviating from the behaviour of the same value in numpy, is not a good idea. For me, that seems more confusing than having a clearly distinct value (pd.NA) that has the different behaviour. > > Thus far, I see three options going forward > > 1. Use NaN for floats, NaT for datelikes, NA for other. > 1-a: Leave NaT with always unequal > 1-b: Change NaT to have NA-like comparison behavior > 2. Use NA everywhere (no NaN for float, no NaT for datelike > 3. Implement a typed `NA`, where we have an `NA` per dtype. > > Option 3 I think solves the array - scalar op issue. It's more complex for > users > though hopefully not too complex? My biggest worry is that it makes the > implementation much more complex, though perhaps I'm being pessimistic. > > On balance, I'm not sure where I come down yet. Good news: we can take > time to > figure this out :) > Thanks for the summary! Personally, I don't like the first option *long term* as it keeps different missing values (eg NaN) with different behaviours for some dtypes as default, while I would like to see us moving to a consistent missing value indicator. And I think we can take a similar approach as we somewhat decided in the original discussion on pd.NA: let's start with a single pd.NA, and we can see later if there is a need to make it typed. Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Wed Feb 19 18:52:10 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Wed, 19 Feb 2020 15:52:10 -0800 Subject: [Pandas-dev] Fwd: What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: Pivoting: Joris, on the call you mentioned a TimestampArray. Can you expand on that a bit? On Wed, Feb 19, 2020 at 2:55 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > > > On Tue, 18 Feb 2020 at 18:20, Tom Augspurger > wrote: > >> >> On Mon, Feb 17, 2020 at 7:17 PM Brock Mendel >> wrote: >> >>> > You have no problem with changing the behavior of NaT, or changing to >>> use pd.NA? >>> >>> If/when we get to a point where we propagate NAs in all other >>> comparisons, I would have no problem with editing `NaT.__richcmp__` to >>> match that convention. >>> >> >> What are the advantages of a NaT with NA-like comparison semantics over >> using NA >> (or NA[datetime])? >> >> 1. Retain dtype in array - scalar ops with a scalar NA >> 2. ... >> 3. Less disruptive than changing to NA >> >> My ... could include things like `isinstance(NaT, Timestamp)` being true >> and >> `NaT.` for Timestamp attributes. But those don't strike me as >> necessarily >> good things. They seem sometimes useful and sometimes harmful. >> >> The downside of changing NaT in comparison operations are >> >> 1. We're diverging from `np.NaT`. I don't know how problematic this >> actually is. >> 2. It's a special case. Should users need to know that datelikes use >> their own >> NA value because the underlying storage is able to store them "in-band" >> rather than as a mask? My gut reaction is "no, users shouldn't be >> exposed to >> this." >> 3. Changing NaT would leave just NaN with the "always unequal in >> comparisons" >> behavior. >> > > Personally, I think changing the behaviour of NaT in pandas, and thus > deviating from the behaviour of the same value in numpy, is not a good > idea. For me, that seems more confusing than having a clearly distinct > value (pd.NA) that has the different behaviour. > > >> >> Thus far, I see three options going forward >> >> 1. Use NaN for floats, NaT for datelikes, NA for other. >> 1-a: Leave NaT with always unequal >> 1-b: Change NaT to have NA-like comparison behavior >> 2. Use NA everywhere (no NaN for float, no NaT for datelike >> 3. Implement a typed `NA`, where we have an `NA` per dtype. >> >> Option 3 I think solves the array - scalar op issue. It's more complex >> for users >> though hopefully not too complex? My biggest worry is that it makes the >> implementation much more complex, though perhaps I'm being pessimistic. >> >> On balance, I'm not sure where I come down yet. Good news: we can take >> time to >> figure this out :) >> > > Thanks for the summary! > Personally, I don't like the first option *long term* as it keeps > different missing values (eg NaN) with different behaviours for some dtypes > as default, while I would like to see us moving to a consistent missing > value indicator. > And I think we can take a similar approach as we somewhat decided in the > original discussion on pd.NA: let's start with a single pd.NA, and we can > see later if there is a need to make it typed. > > Joris > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Feb 26 05:40:46 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 26 Feb 2020 11:40:46 +0100 Subject: [Pandas-dev] Fwd: What could a pandas 2.0 look like? In-Reply-To: References: Message-ID: On Thu, 20 Feb 2020 at 00:52, Brock Mendel wrote: > Pivoting: Joris, on the call you mentioned a TimestampArray. Can you > expand on that a bit? > Basically what I mentioned before in this thread: a new ExtensionArray that uses pd.NA as missing value indicator instead of pd.NaT, and where the NAs are potentially tracked in a mask (as done for the nullable integer dtypes). There are more things about it (like allowing more resolutions? single dtype for tz-naive/tz-aware? ..), but I will try to open a separate discussion going more in depth about this shortly. But, the issue for NA vs NaT is somewhat similar as NA vs NaN, for which I just opened an issue to further discuss this in more detail: https://github.com/pandas-dev/pandas/issues/32265 > > On Wed, Feb 19, 2020 at 2:55 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> >> >> On Tue, 18 Feb 2020 at 18:20, Tom Augspurger >> wrote: >> >>> >>> On Mon, Feb 17, 2020 at 7:17 PM Brock Mendel >>> wrote: >>> >>>> > You have no problem with changing the behavior of NaT, or changing to >>>> use pd.NA? >>>> >>>> If/when we get to a point where we propagate NAs in all other >>>> comparisons, I would have no problem with editing `NaT.__richcmp__` to >>>> match that convention. >>>> >>> >>> What are the advantages of a NaT with NA-like comparison semantics over >>> using NA >>> (or NA[datetime])? >>> >>> 1. Retain dtype in array - scalar ops with a scalar NA >>> 2. ... >>> 3. Less disruptive than changing to NA >>> >>> My ... could include things like `isinstance(NaT, Timestamp)` being true >>> and >>> `NaT.` for Timestamp attributes. But those don't strike me as >>> necessarily >>> good things. They seem sometimes useful and sometimes harmful. >>> >>> The downside of changing NaT in comparison operations are >>> >>> 1. We're diverging from `np.NaT`. I don't know how problematic this >>> actually is. >>> 2. It's a special case. Should users need to know that datelikes use >>> their own >>> NA value because the underlying storage is able to store them >>> "in-band" >>> rather than as a mask? My gut reaction is "no, users shouldn't be >>> exposed to >>> this." >>> 3. Changing NaT would leave just NaN with the "always unequal in >>> comparisons" >>> behavior. >>> >> >> Personally, I think changing the behaviour of NaT in pandas, and thus >> deviating from the behaviour of the same value in numpy, is not a good >> idea. For me, that seems more confusing than having a clearly distinct >> value (pd.NA) that has the different behaviour. >> >> >>> >>> Thus far, I see three options going forward >>> >>> 1. Use NaN for floats, NaT for datelikes, NA for other. >>> 1-a: Leave NaT with always unequal >>> 1-b: Change NaT to have NA-like comparison behavior >>> 2. Use NA everywhere (no NaN for float, no NaT for datelike >>> 3. Implement a typed `NA`, where we have an `NA` per dtype. >>> >>> Option 3 I think solves the array - scalar op issue. It's more complex >>> for users >>> though hopefully not too complex? My biggest worry is that it makes the >>> implementation much more complex, though perhaps I'm being pessimistic. >>> >>> On balance, I'm not sure where I come down yet. Good news: we can take >>> time to >>> figure this out :) >>> >> >> Thanks for the summary! >> Personally, I don't like the first option *long term* as it keeps >> different missing values (eg NaN) with different behaviours for some dtypes >> as default, while I would like to see us moving to a consistent missing >> value indicator. >> And I think we can take a similar approach as we somewhat decided in the >> original discussion on pd.NA: let's start with a single pd.NA, and we can >> see later if there is a need to make it typed. >> >> Joris >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: