[Datetime-SIG] PEP-431/495

Fri Aug 21 22:49:13 CEST 2015

[Stuart Bishop <stuart at stuartbishop.net>]
> Sorry I'm late. pytz author here.

Hi, Stuart!  Nice to see you.  Stay a while :-)

> Gosh you guys write a lot. I've tried to skim things, and will default
> to agreeing with Tim since it is usually the smart thing to do.

Excellent judgment.  Although agreeing with Guido is mandatory, and
he's wrong about some things here ;-)

> A few notes from my skimming:
>
> - I want a boolean added to datetime instances, even if I don't like
> the name, because I can then deprecate pytz and its confusing API and
> implementation. I'm happy to work on Python implementation and
> documentation. It will save me time and effort in the long run.

Later you seem to say you'd prefer a 3-state flag instead, so not sure
you really mean "boolean" here.

> - Most of my thoughts got encoded in PEP-431. This would give us a
> datetime module that operates exactly the way it does today,

No.  While 431 was highly obscure on this point, it turned out that
Lennart was determined to change arithmetic behavior.  That can't fly,
for backward compatibility, and because even "aware" datetimes were
intended to use a "naive time" model internally.

Specifically, if you add timedelta(days=1) to a datetime today, you
get "same time tomorrow" (day goes up by 1, but hour, minute, second
and microsecond remain the same) in all cases. Even if a DST
transition (or base-offset change, or leap-second change) occurred.
That's now called "classic" arithmetic.  The default behavior can't be
changed.

What you seem to have in mind (accounting for two of the three known
reasons for why a local clock may jump:  DST and base-offset changes,
but not leap second changes) is now called "timeline" (sometimes
"strict") arithmetic.

According to Lennart, under PEP 431 timeline arithmetic would always
be used.  Under PEP 495, nothing about arithmetic changes.  495 is
less ambitious, only intending to supply the bit(s) needed to _allow_
timeline arithmetic to be implemented as an option later.  PEP 500 is
about supplying different arithmetics, but Guido hates PEP 500.

In the end, I expect timezone wrappers will supply factory functions,
either separate functions for "give me such-and-such a timezone using
classic arithmetic" and "give me such-and-such a timezone using
timeline arithmetic", or a single function specifying the desired
timezone and an optional flag to specify the arithmetic desired.

> but with the option of performing pytz style unambiguous datetime
> arithmetic

There was nothing optional about it in 431,  495 doesn't address
arithmetic, except to make it _possible_ to implement timeline
arithmetic.

> without pytz and its confusing API.

> If the developer explicity set the is_dst flag, then exceptions would
> be raised when trying to instantiate an ambiguous or invalid timestamp.
> For code that does not specify the new, optional flag things work as
> they do today and a best guess made when the localized datetime is
> constructed.

It's possible that 495 should do more in this direction.  For now, it
specifies enough that someone who cares can easily write a function to
distinguish among "ambiguous time (in a fold)", "invalid time" (in a
gap), and "happy time" ;-) , and do whatever _they_ want (ignore some
subset, raise an exception, print a warning, supply a default, prompt
the user for more info, ...).

> - PEP-495 seems similar to PEP-431,

See above.  431 was about arithmetic, although it didn't say so
clearly.  495 is _only_ about adding a flag.

> except that it attempts to allow things continue in the face of
> an ambiguous or invalid localized datetime.
>
> The boolean flag is not tristate, so there is no way to have
> strict checking of input. It doesn't matter if the developer said
> 'whatever' and left the flag on the default, or cared enough to
> explicitly override it.

As above, it's possible 495 should do more.  But it's hard to know
when to stop.  For example, there are many ways of specifying a
datetime, including. e.g., using .combine() to paste a date and time
together.  It's generally impossible to make a fold/gap determination
on a time alone - that's only possible in combination with a date.  So
does .combine() also need to whine?  It's simpler overall to leave it
to those users who care to check when they do care.

> - The rules in PEP-495 for utcoffset() and dst() to deal with
> ambiguous times only work in simple cases, as there dst offsets both
> more and less than 1 hour, and there is no stdoffset since the offset
> can change at the same time (eg. Europe/Vilnius 1941, where the clocks
> ended up going backwards for summer time instead of forwards).

495 couldn't care less what causes folds and gaps - it's equally
applicable to all causes, and whether in isolation or combination.
What it _does_ assume is that a single bit suffices to resolve
ambiguities:  that there is no case in which more than two UTC times
have the same spelling on a local clock.  The goal of the PEP is to
supply that bit.  The burden is on the tzinfo supplier to set and use
it correctly.  The burden is also on the tzinfo supplier to supply a
.utcoffset() "that works" to convert a local time to UTC, to supply a
.dst() that returns whatever the tzinfo supplier thinks it should
return, and to supply a .fromutc() that sets the bit correctly.

The default .fromutc() is indeed too weak to handle anything except
zones subject to nothing fancier than DST transitions alternating
between "zero" and "non-zero", and that's not changing either.
Neither will the default .fromutc()  be changed to set
first/fold/later/is_dst - only a tzinfo implementer has enough info
about how the timezone works to set the bit correctly and
semi-efficiently in all cases (the default .fromutc() can only ask
what the total UTC, and DST, offsets are at specific microseconds in
local time - it has no knowledge deeper than that, because those are
the only questions the tzinfo interface _can_ be asked).

As to "more and less than 1 hour", yes, the PEP hasn't been updated to
clarify that "hour" _means_ "some number of microseconds" ;-)

> - Other APIs I know of, including Python's time module, uses is_dst or
> isdst as the required boolean flag. As do the timezone databases
> containing the data we need. I think the argument against the is_dst
> flag name in PEP-495 is flaccid.

is_dst makes no sense for base-offset or leap-second transitions
either; "first"/"fold"/"later" make equally clear sense for all causes
of folds.  But Guido hates leap seconds, seemingly intending to make
it impossible for anyone to support them directly (via overloading
datetime arithmetic operators), and so the case against "is_dst" is
weaker now.

> - If there is an argument in favour of 'first' over 'is_dst', it is
> because occasionally there are timezone changes without a dst
> transition. If we call it is_dst, we agree that in a few rare
> historical cases we are going to have to lie.

There are only two tzinfo authors in the world ;-) (you and Gustavo),
and by all evidence you're both way more than bright enough to adapt
to any spelling ;-)

> - My argument in favour of 'is_dst' over 'first' is that this is what
> we have in the data we are trying to load.  You commonly have
.> a timestamp with a timezone abbreviation and/or offset. This can
> easily be converted to an is_dst flag.

You mean by using platform C library functions (albeit perhaps wrapped
by Python)?

> To convert it to a 'first' flag, we need to first parse the datetime,

I'm unclear on this.  To get a datetime _at all_ the timestamp has to
be converted to calendar notation (year, month, ...).  Which is what
I'm guessing "parse" means here.  That much has to be done in any
case.

> determine the transition points that year, and then which side of
> the nearest transition point it lies.  Note that there can be more
> than 2 transition points in a year, and no api has been discussed for
> discovering them.

Python doesn't need such an API.  It needs the tzinfo author to
implement .utcoffset(), .dst(), and .fromutc() according to whatever
rules a timezone requires.  Python code would convert the timestamp to
UTC calendar notation first, then use .astimezone() to convert to
whatever "timezone abbreviation and/or offset" was specified.
astimezone() in turn gets everything it needs from the tzinfo's
.fromutc().

I'm unclear anyway on why you'd trust an external is_dst flag to be
reliable in the funky cases where, e.g., base-offset and DST
transitions coincide.  You either think it's important to handle such
cases or you don't.  If you do, what do _you_ think tm_isdst means in
such cases?  If you're relying on external code to compute is_dst for
you, then it doesn't matter what anyone in the Python world thinks it
should mean.  It only matters what the universe of C library authors
thought it should mean, assuming they were even aware of such cases.
The relevant standards are no help at all in such edge cases.

The web is filled with complaints about puzzling tm_isdst behavior in
edge cases, and no two implementations seem to agree on what -1
"really means" even in seemingly straightforward cases.  I'd rather
that Python tzinfo authors implement exactly what _they_ think a
timezone's rules really are - which indeed requires analyzing a time
using all the timezone's internal rules.

> - I think datetime should consider 1 day == 24 hours and not have
> concepts like years or months, just like it does today. As others
> suggested, a separate module dealing with leap years and variable
> length days may be useful to some people, as would leapsecond support
> for astronomers and astrologers. But if the default implementation
> gives different results to all the other tools on your system, people
> will think the default is wrong.

Not sure what you mean here without specific examples of what you have
in mind.  But, as above, classic arithmetic will remain the default
regardless - it's a dozen years too late to change that, even if
everyone wanted to (and - surprise - everyone doesn't ;-) ).

> - Offsets should ideally be declared in seconds. Last I looked, the
> current Python implementation rounds them to the nearest minute and it
> would be nice to fix that. These are almost always historical, dating
> from when noon was when the sun was at its highest point above the
> capital (eg. Europe/Amsterdam before 1938)

Offsets are currently required to be a multiple of a minute (no
rounding is done - an exception is raised if an offset is not a
multiple of a minute, with magnitude less than 24*60 (the number of
minutes in a day)).  That should change, and Alexander has already
done most of the work for it, but it's not in the scope of this PEP.
"The flag" can be added with or without that change.

> - There are cases where there are gaps at the end of DST, and folds at
> the beginning of DST, when the timezone offsets were changed
> simultaneously with the dst flag.

That's fine, provided again that a single bit suffices to resolve
ambiguous times on the local clock.  A fold is a fold and a gap is a
gap, regardless of cause.  It's only if we, e.g., _name_ the flag
"is_dst" that someone is likely to erroneously assume that the flag
always _means_ "and so there's a fold when it changes from True to
False, and a gap when it changes from False to True".

> - Microsoft's timezone database does not contain historical
> information, which is why databases that need support under Windows
> like PostgreSQL include the IANA/Olson database.
>
> - Thank you to everyone who has been working on this. I've wanted it
> for a long, long time but never got around to remembering how to write
> C.

Au contraire - thank _you_ for pytz!  That was such an heroic effort
to overcome the lack of a bit that it's legendary :-)  We'll get this
all to work cleanly in the end.