[Pandas-dev] tslibs 2.0 and non-nanosecond datetime64/timedelta64

Joris Van den Bossche jorisvandenbossche at gmail.com
Tue Jun 2 15:42:30 EDT 2020


On Tue, 2 Jun 2020 at 01:36, Brock Mendel <jbrockmendel at gmail.com> wrote:

> Before responding to questions, one topic I forgot to include in the OP:
>
> The performance of Timestamp, Timedelta, and Period could be improved (i
> do not have an estimate of how much) if they were cdef (cython) classes.
> This is not viable at the moment because they each have `__new__` methods,
> which are needed because the constructors can return pd.NaT.  If we had
> dtype-specific NaTs (xref #24983
> <https://github.com/pandas-dev/pandas/issues/24983>) that would allow us
> to make these cdef classes.
>
> ---------
> > Will this [casting non-nano timestamps to nano to use existing
> tz-conversion code] cause issues if the original datetime isn't in the
> bounds of a ns-precision timestamp?
>
> Both technically and conceptually, yes.  [note to self, expand on this
> before hitting send]
>
> > [...] since it represents a point in time rather than a span.
>
> From an implementations standpoint, that distinction is meaningless; the
> same conversion code (the hard part) is used for both.  Conceptually, I
> think of `datetime64[minute]` as representing the same thing as
> `Period[minute]` (both can be used to represent the "4:32" in the corner of
> my screen).
>

Implementation wise it's maybe the same, but I think it's useful to keep
those concepts separated towards users in the API.
Timestamps are points in time, Periods are time spans. I think it is good
to keep this distinction.

And for timestamps, I think users should mostly not care / need to think
about the resolution (the main reason they need to care now is when their
dates might not fit in the range supported by nanoseconds, but by having a
different default resolution, that issue should also be mostly gone). So
it's in this light that I don't think it is needed to support resolutions
above seconds.


>
> Or for Timestamp[D] we can just call that a Date dtype instead of
> re-implementing it (xref #34441
> <https://github.com/pandas-dev/pandas/pull/34441>)
>
> ---------
> > Personally, I don't think we necessarily need to add all units that are
> supported by numpy's datetime64/timedelta64 dtypes.
>
> I have a strong preference against using the Year or Month units, as the
> conversions of those to/from the others is not just
> multiplication/division.  The others I don't feel as strongly about; once
> nanos is no longer hard-coded, the marginal cost of adding more should be
> relatively small.
>
>
> On Sat, May 30, 2020 at 12:18 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> Thanks for starting this discussion, Brock!
>>
>> On Fri, 29 May 2020 at 21:03, Tom Augspurger <tom.augspurger88 at gmail.com>
>> wrote:
>>
>>> On Fri, May 29, 2020 at 11:37 AM Brock Mendel <jbrockmendel at gmail.com>
>>> wrote:
>>>
>>>>
>>>> We could then consider de-duplication. Tick is already redundant with
>>>> Timedelta, and Timestamp[H] would render Period[H] redundant.  With
>>>> appropriate deprecation cycle, we could rip out a bunch of code.
>>>>
>>>
>>> What would the user facing changes that warrant deprecation? For me,
>>> `Period` represents a span of time. It would make sense to implement
>>> something like `pd.Timestamp("2000-01-01") in pd.Period("2000-01-01",
>>> freq="H")`. But something checking whether that timestamp is in a
>>> `Timestamp[H]` doesn't seem natural, since it represents a point in time
>>> rather than a span.
>>>
>>>
>> Personally, I don't think we necessarily need to add all units that are
>> supported by numpy's datetime64/timedelta64 dtypes. First, because I don't
>> think it is an important use case (people mostly want to be able to have
>> dates outside of the range limits that nanosecond resolution gives us), and
>> also because it makes it conceptually a lot more difficult. For example,
>> what is a "Timestamp[H]" value? Does it represent the beginning or the end
>> of the hour? That are questions that are already handled by the Period
>> dtype, and I think it is a good thing to keep those concepts separated (you
>> can of course ask the same question with a millisecond resolution, but I
>> think generally people don't do that).
>> Further, all the resolutions from nanosecond up to second are "just"
>> multiplications x1000, keeping the implementation more simple (compared to
>> resolutions of hours, months, ..).
>>
>> So for a timestamp dtype, we could maybe only support ns / µs / ms / s
>> resolutions?
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200602/bb870b8e/attachment.html>


More information about the Pandas-dev mailing list