[Numpy-discussion] datetime64: Remove deprecation warning when constructing with timezone

Noam Yorav-Raphael noamraph at gmail.com
Sat Nov 7 15:22:20 EST 2020


On Fri, Nov 6, 2020 at 5:58 PM Brock Mendel <jbrockmendel at gmail.com> wrote:

> > I find the whole notion of a "timezone naive timestamp" to be nearly
> meaningless
>
> From the perspective of, say, the dateutil parser, what would you do with
> "2020-11-06 07:48"?  If you assume it's UTC you'll be wrong in this case.
> If you assume it is in your local timezone, you'll be wrong in Europe.
> Timezone-naive datetimes are an abstraction for exactly this case.
>
> I'm not sure what you mean by "the perspective of the dateutil parser".
Indeed, "2020-11-06 07:48" is not a well-defined timestamp, since it
doesn't define a specific moment in time. If you ask what a timestamp type
should do when constructed from such a string, then I can think of two
reasonable alternatives. One is to just not allow it, and perhaps provide a
.from_local() method which makes it explicit. The other is to allow it, and
make it clear that when an offset is not defined, it uses the environment's
timezone to convert the string to a timestamp. I wouldn't use the third
alternative, which is to parse it in UTC, since it doesn't add a lot of
convenience since it's easy to add a "Z" to the string.


> >>> t0 = pd.Timestamp.now()
>
> You can use `pd.Timestamp.now("UTC")`.  See also
> https://mail.python.org/archives/list/datetime-sig@python.org/thread/PT4JWJLYBE5R2QASVBPZLHH37ULJQR43/
> , https://github.com/pandas-dev/pandas/issues/22451
>
> Thanks for pointing this out. However, this doesn't work:

>>> pd.Timestamp.fromtimestamp(time.time(), 'UTC')
Traceback (most recent call last):
...
TypeError: fromtimestamp() takes exactly 2 positional arguments (3 given)

Also, this doesn't work:

>>> t0 = pd.Timestamp.now('UTC')
... t1 = pd.Timestamp.now('Asia/Jerusalem')
... t1 - t0
Traceback (most recent call last):
...
TypeError: Timestamp subtraction must have the same timezones or no
timezones

Also, this doesn't do what it probably should:

>>> pd.Timestamp.now('UTC'), pd.Timestamp.now().tz_localize('UTC')
(Timestamp('2020-11-07 20:18:38.719603+0000', tz='UTC'),
 Timestamp('2020-11-08 01:18:38.719701+0000', tz='UTC'))

(I have no idea how the second result was calculated, but it's wrong. It
should have been equal to the first)

So, pd.Timestamp is crap. I think that adding np.timestamp64 may finally
bring a sane timestamp type to python.

Thanks,
Noam


>
>
>
> On Fri, Nov 6, 2020 at 2:48 AM Noam Yorav-Raphael <noamraph at gmail.com>
> wrote:
>
>> Hi,
>>
>> I actually arrived at this by first trying to use pandas.Timestamp and
>> getting very frustrated about it. With pandas, I get:
>>
>> >>> pd.Timestamp.now()
>> Timestamp('2020-11-06 09:45:24.249851')
>>
>> I find the whole notion of a "timezone naive timestamp" to be nearly
>> meaningless. A timestamp should mean a moment in time (as the current numpy
>> documentation defines very well). A "naive timestamp" doesn't mean
>> anything. It's exactly like a "unit naive length". I can have a Length type
>> which just takes a number, and be very happy that it works both if my "unit
>> zone" is inches or centimeters. So "Length(3)" will mean 3 cm in most of
>> the world and 3 inches in the US. But then, if I get "Length(3)" from
>> someone, I can't be sure what length it refers to.
>>
>> So currently, this happens with pandas timestamps:
>>
>> >>> os.environ['TZ'] = 'UTC'; time.tzset()
>> ... t0 = pd.Timestamp.now()
>> ... time.sleep(1)
>> ... os.environ['TZ'] = 'EST-5'; time.tzset()
>> ... t1 = pd.Timestamp.now()
>> ... t1 - t0
>> Timedelta('0 days 05:00:01.001583')
>>
>> This is not just theoretical - I actually need to work with data from
>> several devices, each in its own time zone. And I need to know that I won't
>> get such meaningless results.
>>
>> And you can even get something like this:
>>
>> >>> t0 = pd.Timestamp.now()
>> ... time.sleep(10)
>> ... t1 = pd.Timestamp.now()
>> ... t1 - t0
>> Timedelta('0 days 01:00:10.001583')
>>
>> if the first measurement happened to be in winter time and the second
>> measurement happened to be in daylight saving time.
>>
>> The solution is simple, and is what datetime64 used to do before the
>> change - have a type that just represents a moment in time. It's not "in
>> UTC" - it just stores the number of seconds that passed since an agreed
>> moment in time (which is usually 1970-01-01 02:00+0200, which is more
>> commonly referred to as 1970-01-01 00:00Z - it's the exact same moment).
>>
>> I think it would make things clearer if I'll mention that there are
>> operations that are not dealing with timestamps. For example, it's
>> meaningless to ask what is the year of a timestamp - it may depend on the
>> time zone. These are always *human* related questions, that depend on
>> certain human conventions. We can call them "calendar questions". For these
>> types of questions, a type that includes both a timestamp and a timezone
>> offset (in minutes from UTC) can be useful. Some questions even require
>> full timezone information, meaning a function that defines what's the
>> timezone offset for each moment. However, I don't think numpy should deal
>> with those calendar issues. As a very simple example, even for
>> "timestamp+offset" types, it's not clear how to compare them - should
>> values with the same timestamp and different offsets be considered equal or
>> not? And in virtually all of my data analysis, this calendar aspect has
>> nothing to do with the questions I'm trying to answer.
>>
>> I have a suggestion. Instead of changing datetime64 (which I consider to
>> be ill-defined, but never mind), add a new type called "timestamp64". It
>> will have the exact same behavior as datetime64 had before the change,
>> except that its only allowed units will be seconds, milliseconds,
>> microseconds and nanoseconds.  Removing the longer units will make it clear
>> that it doesn't deal with calendar and dates. Also, all the business day
>> functionality will not be applicable to timestamp64. In order to get
>> calendar information (such as the year) from timestamp64, you will have to
>> manually convert it to python's datetime (or to np.datetime64) with an
>> explicit timezone (utc, local, an offset, or a timezone object).
>>
>> What do you think?
>>
>> Thanks,
>> Noam
>>
>>
>>
>>
>>
>> On Fri, Nov 6, 2020 at 1:45 AM Stephan Hoyer <shoyer at gmail.com> wrote:
>>
>>> I can try to dig up the old discussions, but datetime64 used to
>>> implement both (1) and (3), and this was updated in a very intentional way.
>>> Datetime64 now works like Python's own time-zone naive datetime.datetime
>>> objects. The documentation referencing "Z" should be updated -- datetime64
>>> can be in any timezone you like.
>>>
>>> Timezone aware datetime objects are certainly useful, but NumPy's
>>> datetime64 was restricted to UTC. The consensus was that it was worse to
>>> have UTC-only rather than timezone-naive-only. NumPy's datetime64 is often
>>> used for data analysis purposes, for which automatic conversion to the
>>> local timezone of the computer running the analysis is often
>>> counter-productive.
>>>
>>> If you care about timezone conversions, I would highly recommend looking
>>> into pandas's Timestamp class for this purpose. In the future, this would
>>> be a good use-case for a new custom NumPy dtype. (The existing
>>> np.datetime64 code cannot easily handle multiple timezones.)
>>>
>>> On Thu, Nov 5, 2020 at 1:04 PM Eric Wieser <wieser.eric+numpy at gmail.com>
>>> wrote:
>>>
>>>> Without weighing in yet on how I feel about the deprecation, you can
>>>> see some discussion about why this was originally deprecated in the PR that
>>>> introduced the warning:
>>>>
>>>> https://github.com/numpy/numpy/pull/6453
>>>>
>>>> Eric
>>>>
>>>> On Thu, Nov 5, 2020, 20:13 Noam Yorav-Raphael <noamraph at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I suggest removing the deprecation warning when constructing a
>>>>> datetime64 with a timezone. For example, this is the current behavior:
>>>>>
>>>>> >>> np.datetime64('2020-11-05 16:00+0200')
>>>>> <stdin>:1: DeprecationWarning: parsing timezone aware datetimes is
>>>>> deprecated; this will raise an error in the future
>>>>> numpy.datetime64('2020-11-05T14:00')
>>>>>
>>>>> I suggest removing the deprecation warning because I find this to be a
>>>>> useful behavior, and because it is a correct behavior. The manual says:
>>>>> "The datetime object represents a single moment in time... Datetimes are
>>>>> always stored based on POSIX time, with an epoch of 1970-01-01T00:00Z."
>>>>> So 2020-11-05T16:00+0200 is indeed the moment in time represented by
>>>>> np.datetime64('2020-11-05T14:00').
>>>>>
>>>>> I just used this to restrict my data set to records created after a
>>>>> certain moment. It was easier for me to write the moment in my local time
>>>>> and add "+0200" than to figure out the moment representation in UTC.
>>>>>
>>>>> So this is my simple suggestion: remove the deprecation warning.
>>>>>
>>>>>
>>>>> Beyond that, I have 3 ideas for changing the repr of datetime64 that I
>>>>> would like to discuss.
>>>>>
>>>>> 1. Add "Z" at the end, for example,
>>>>> numpy.datetime64('2020-11-05T14:00Z'). This will make it clear to which
>>>>> moment it refers. I think this is significant - I had to dig quite a bit to
>>>>> realize that datetime64('2020-11-05T14:00') means 14:00 UTC.
>>>>>
>>>>> 2. Replace the 'T' with a space. I just find it much easier to read
>>>>> '2020-11-05 14:00Z' than '2020-11-05T14:00Z'. The long sequence of
>>>>> characters makes it hard for my brain to parse.
>>>>>
>>>>> 3. This will require discussion, but will be very convenient: have the
>>>>> repr display the time using the environment time zone, including a time
>>>>> offset. So, in my specific time zone (+0200), I will have:
>>>>>
>>>>> repr(np.datetime64('2020-11-05 14:00Z')) ==
>>>>> "numpy.datetime64('2020-11-05T16:00+0200')"
>>>>>
>>>>> I'm sure the pros and cons of having an environment-dependent repr
>>>>> should be discussed. But I will list some pros:
>>>>> 1. It's very convenient - it's immediately obvious to me to which
>>>>> moment 2020-11-05 16:00+0200 refers.
>>>>> 2. It's well defined - I may collect timestamps from machines with
>>>>> different time zones, and I will be able to know to which exact moment each
>>>>> timestamp refers.
>>>>> 3. It's very simple - I could compare any two timestamps, I don't have
>>>>> to worry about time zones.
>>>>>
>>>>> I would be happy to hear your thoughts.
>>>>>
>>>>> Thanks,
>>>>> Noam
>>>>> _______________________________________________
>>>>> NumPy-Discussion mailing list
>>>>> NumPy-Discussion at python.org
>>>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at python.org
>>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/numpy-discussion/attachments/20201107/ace8dd57/attachment-0001.html>


More information about the NumPy-Discussion mailing list