[Numpy-discussion] fixing up datetime
Dave Hirschfeld
dave.hirschfeld at gmail.com
Tue Jun 7 13:41:55 EDT 2011
Robert Kern <robert.kern <at> gmail.com> writes:
>
> On Tue, Jun 7, 2011 at 07:34, Dave Hirschfeld <dave.hirschfeld <at> gmail.com>
wrote:
>
> > I'm not convinced about the events concept - it seems to add complexity
> > for something which could be accomplished better in other ways. A [Y]//4
> > dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There
> > may well be a good reason for it however I can't see the need for it in my
> > own applications.
>
> Well, [D/100] doesn't represent [864s]. It represents something that
> happens 100 times a day, but not necessarily at precise regular
> intervals. For example, suppose that I am representing payments that
> happen twice a month, say on the 1st and 15th of every month, or the
> 5th and 20th. I would use [M/2] to represent that. It's not [2W], and
> it's not [15D]. It's twice a month.
>
> The default conversions may seem to imply that [D/100] is equivalent
> to [864s], but they are not intended to. They are just a starting
> point for one to write one's own, more specific conversions.
> Similarly, we have default conversions from low frequencies to high
> frequencies defaulting to representing the higher precision event at
> the beginning of the low frequency interval. E.g. for days->seconds,
> we assume that the day is representing the initial second at midnight
> of that day. We then use offsets to allow the user to add more
> information to specify it more precisely.
>
That would be one way of dealing with irregularly spaced data. I would argue
that the example is somewhat back-to-front though. If something happens
twice a month it's not occuring at a monthly frequency, but at a higher
frequency. In this case the lowest frequency which can capture this data is
daily frequency so it sould be stored at daily frequency and if monthly
statistics are required the series can be aggregated up after the fact. e.g.
In [2]: dates = ts.date_array('01-Jan-2011','01-Dec-2011', freq='M')
In [3]: dates = dates.asfreq('D','START')
In [4]: dates = (dates[:,None] + np.tile(array([4, 19]), [12, 1])).ravel()
In [5]: data = 100 + 10*randn(12, 2)
In [6]: payments = ts.time_series(data.ravel(), dates)
In [7]: payments
Out[7]:
timeseries([ 103.76588849 101.29566771 91.10363573 101.90578443 102.12588909
89.86413807 94.89200485 93.69989375 103.37375202 104.7628273
97.45956699 93.39594431 94.79258639 102.90656477 87.42346985
91.43556069 95.21947628 93.0671271 107.07400065 92.0835356
94.11035154 86.66521318 109.36556861 101.69789341],
dates = [05-Jan-2011 20-Jan-2011 05-Feb-2011 20-Feb-2011 05-Mar-2011 20-Mar-2011
05-Apr-2011 20-Apr-2011 05-May-2011 20-May-2011 05-Jun-2011 20-Jun-2011
05-Jul-2011 20-Jul-2011 05-Aug-2011 20-Aug-2011 05-Sep-2011 20-Sep-2011
05-Oct-2011 20-Oct-2011 05-Nov-2011 20-Nov-2011 05-Dec-2011 20-Dec-2011],
freq = D)
In [8]: payments.convert('M', ma.count)
Out[8]:
timeseries([2 2 2 2 2 2 2 2 2 2 2 2],
dates = [Jan-2011 ... Dec-2011],
freq = M)
In [9]: payments.convert('M', ma.sum)
Out[9]:
timeseries([205.061556202 193.009420163 191.990027161 188.591898598 208.136579315
190.855511303 197.699151161 178.859030538 188.286603379 199.157536259
180.775564724 211.063462017],
dates = [Jan-2011 ... Dec-2011],
freq = M)
Alternatively for a fixed number of events per-period the values can just be
stored in a 2D array - e.g.
In [10]: dates = ts.date_array('01-Jan-2011','01-Dec-2011', freq='M')
In [11]: payments = ts.time_series(data, dates)
In [12]: payments
Out[12]:
timeseries(
[[ 103.76588849 101.29566771]
[ 91.10363573 101.90578443]
[ 102.12588909 89.86413807]
[ 94.89200485 93.69989375]
[ 103.37375202 104.7628273 ]
[ 97.45956699 93.39594431]
[ 94.79258639 102.90656477]
[ 87.42346985 91.43556069]
[ 95.21947628 93.0671271 ]
[ 107.07400065 92.0835356 ]
[ 94.11035154 86.66521318]
[ 109.36556861 101.69789341]],
dates =
[Jan-2011 ... Dec-2011],
freq = M)
In [13]: payments.sum(1)
Out[13]:
timeseries([ 205.0615562 193.00942016 191.99002716 188.5918986 208.13657931
190.8555113 197.69915116 178.85903054 188.28660338 199.15753626
180.77556472 211.06346202],
dates = [Jan-2011 ... Dec-2011],
freq = M)
It seems to me that either of these would satisfy the use-case with the added
benefit of simplifying the datetime implementation. That said I'm not against
the proposal if it provides someone with some benefit...
Regarding the default conversions, the start of the interval is a perfectly
acceptable default (and my preferred choice) however being able to specify
the end is also useful as for the month-day conversion the end of each month
can't be specified by a fixed offset from the start because of their varying
lengths. Of course this can be found by subtracting 1 from the start of the next
month:
(M + 1).asfreq('D',offset=0) - 1
but just as it's easier to write List[-1] rather than List[List.length -1] it's
easier to write
M.asfreq('D',offset=-1)
Regards,
Dave
More information about the NumPy-Discussion
mailing list