[Numpy-discussion] fixing up datetime

Tue Jun 7 13:41:55 EDT 2011

Robert Kern <robert.kern <at> gmail.com> writes:

> 
> On Tue, Jun 7, 2011 at 07:34, Dave Hirschfeld <dave.hirschfeld <at> gmail.com>
wrote:
> 
> > I'm not convinced about the events concept - it seems to add complexity
> > for something which could be accomplished better in other ways. A [Y]//4
> > dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There
> > may well be a good reason for it however I can't see the need for it in my
> > own applications.
> 
> Well, [D/100] doesn't represent [864s]. It represents something that
> happens 100 times a day, but not necessarily at precise regular
> intervals. For example, suppose that I am representing payments that
> happen twice a month, say on the 1st and 15th of every month, or the
> 5th and 20th. I would use [M/2] to represent that. It's not [2W], and
> it's not [15D]. It's twice a month.
> 
> The default conversions may seem to imply that [D/100] is equivalent
> to [864s], but they are not intended to. They are just a starting
> point for one to write one's own, more specific conversions.
> Similarly, we have default conversions from low frequencies to high
> frequencies defaulting to representing the higher precision event at
> the beginning of the low frequency interval. E.g. for days->seconds,
> we assume that the day is representing the initial second at midnight
> of that day. We then use offsets to allow the user to add more
> information to specify it more precisely.
> 

That would be one way of dealing with irregularly spaced data. I would argue
that the example is somewhat back-to-front though. If something happens
twice a month it's not occuring at a monthly frequency, but at a higher
frequency. In this case the lowest frequency which can capture this data is
daily frequency so it sould be stored at daily frequency and if monthly
statistics are required the series can be aggregated up after the fact. e.g.

In [2]: dates = ts.date_array('01-Jan-2011','01-Dec-2011', freq='M')

In [3]: dates = dates.asfreq('D','START')

In [4]: dates = (dates[:,None] + np.tile(array([4, 19]), [12, 1])).ravel()

In [5]: data = 100 + 10*randn(12, 2)

In [6]: payments = ts.time_series(data.ravel(), dates)

In [7]: payments
Out[7]:
timeseries([ 103.76588849  101.29566771   91.10363573  101.90578443  102.12588909
   89.86413807   94.89200485   93.69989375  103.37375202  104.7628273
   97.45956699   93.39594431   94.79258639  102.90656477   87.42346985
   91.43556069   95.21947628   93.0671271   107.07400065   92.0835356
   94.11035154   86.66521318  109.36556861  101.69789341],
   dates = [05-Jan-2011 20-Jan-2011 05-Feb-2011 20-Feb-2011 05-Mar-2011 20-Mar-2011
 05-Apr-2011 20-Apr-2011 05-May-2011 20-May-2011 05-Jun-2011 20-Jun-2011
 05-Jul-2011 20-Jul-2011 05-Aug-2011 20-Aug-2011 05-Sep-2011 20-Sep-2011
 05-Oct-2011 20-Oct-2011 05-Nov-2011 20-Nov-2011 05-Dec-2011 20-Dec-2011],
   freq  = D)

In [8]: payments.convert('M', ma.count)
Out[8]:
timeseries([2 2 2 2 2 2 2 2 2 2 2 2],
   dates = [Jan-2011 ... Dec-2011],
   freq  = M)

In [9]: payments.convert('M', ma.sum)
Out[9]:
timeseries([205.061556202 193.009420163 191.990027161 188.591898598 208.136579315
 190.855511303 197.699151161 178.859030538 188.286603379 199.157536259
 180.775564724 211.063462017],
   dates = [Jan-2011 ... Dec-2011],
   freq  = M)

Alternatively for a fixed number of events per-period the values can just be
stored in a 2D array - e.g.

In [10]: dates = ts.date_array('01-Jan-2011','01-Dec-2011', freq='M')

In [11]: payments = ts.time_series(data, dates)

In [12]: payments
Out[12]:
timeseries(
 [[ 103.76588849  101.29566771]
 [  91.10363573  101.90578443]
 [ 102.12588909   89.86413807]
 [  94.89200485   93.69989375]
 [ 103.37375202  104.7628273 ]
 [  97.45956699   93.39594431]
 [  94.79258639  102.90656477]
 [  87.42346985   91.43556069]
 [  95.21947628   93.0671271 ]
 [ 107.07400065   92.0835356 ]
 [  94.11035154   86.66521318]
 [ 109.36556861  101.69789341]],
    dates =
 [Jan-2011 ... Dec-2011],
    freq  = M)

In [13]: payments.sum(1)
Out[13]:
timeseries([ 205.0615562   193.00942016  191.99002716  188.5918986   208.13657931
  190.8555113   197.69915116  178.85903054  188.28660338  199.15753626
  180.77556472  211.06346202],
   dates = [Jan-2011 ... Dec-2011],
   freq  = M)

It seems to me that either of these would satisfy the use-case with the added
benefit of simplifying the datetime implementation. That said I'm not against
the proposal if it provides someone with some benefit...

Regarding the default conversions, the start of the interval is a perfectly
acceptable default (and my preferred choice) however being able to specify
the end is also useful as for the month-day conversion the end of each month
can't be specified by a fixed offset from the start because of their varying
lengths. Of course this can be found by subtracting 1 from the start of the next
month: 

(M + 1).asfreq('D',offset=0) - 1

but just as it's easier to write List[-1] rather than List[List.length -1] it's
easier to write

M.asfreq('D',offset=-1)

Regards,
Dave