[SciPy-User] Matching date lists

Thu Feb 4 10:01:35 EST 2010

On Thu, Feb 4, 2010 at 9:58 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
> On Thu, Feb 4, 2010 at 9:14 AM, Cumberland, Burly
> <Burly.Cumberland at coherent.com> wrote:
>> Hi,
>>
>>
>>
>> I have several datasets which are linked to date/timestamps. I import these
>> in and convert all the dates to python datetime objects. So for instance I
>> might have something like
>>
>>
>>
>> Array of datetime objects with an array of data values associated with it,
>> say A.
>>
>> Another array of datetime objects with several arrays of data values
>> associated, say B, C and D.
>>
>>
>>
>> The time stream is not continuous, i.e. there may be 5-6 days data then
>> nothing for a day then 10 days data. While some of the arrays are at regular
>> sample intervals (frequency) at least one of them has data generated at a
>> higher frequency but with no fixed period. Ideally I would like to associate
>> the closest measurement from this list with the datetime stamp from the
>> first list. Thus if I have
>>
>>
>>
>> array([2009-12-23 13:57:16, 2009-12-23 13:58:15, 2009-12-23 13:59:14,
>>
>>       2009-12-23 14:00:14, 2009-12-23 14:01:13, 2009-12-23 14:02:13,
>>
>>       2009-12-23 14:03:13, 2009-12-23 14:04:12, 2009-12-23 14:05:12,
>>
>>       2009-12-23 14:06:12], dtype=object)
>>
>>
>>
>> and
>>
>>
>>
>> array([2009-12-23 13:57:21, 2009-12-23 13:57:28, 2009-12-23 13:57:37,
>>
>>       2009-12-23 13:57:44, 2009-12-23 13:57:53, 2009-12-23 13:58:02,
>>
>>       2009-12-23 13:58:09, 2009-12-23 13:58:17, 2009-12-23 13:58:25,
>>
>>       2009-12-23 13:58:33], dtype=object)
>>
>>
>>
>> I'd like to tie my values for the 1st and 8th values from the second array
>> to the first two values from the first array (assuming I'm happy that the
>> data is taken within 5 seconds of each other). Thus I'd mask or disregard
>> all the data in the second array set (B, C and D) that isn't measured within
>> a reasonable time period of the first.
>>
>>
>>
>> Clearly I could write a loop and do comparisons and then copy the data or
>> pop it. The problem is there's a quarter of a million items in the data set
>> at the minute and it continues to grow. So I was wondering if anyone can
>> recommend a method or module, I'd a brief look at timeseries and pandas but
>> neither appears to have a tool which resolves this for me.
>>
>>
>>
>> Any suggestions welcome.
>>
>>
>>
>> Regards,
>>
>> Burly.
>>
>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>>
>
> Assuming that measurements in the second array have to happen after
> the timestamps in the first array, I would suggest a searchsorted
> approach:
>
> indexer = first_list.searchsorted(second_list)
>
> closest_dates = first_list.take(indexer - 1) # need date prior-- does
> not deal with date equality though
>
> deltas = (second_list - closest_dates)
>
> mask = deltas < timedelta(seconds=5)
>
> # now do as you wish with the mask and indexer
>
> I think you just need to check that edge cases (beginning and end of
> the arrays) are being handled correctly. If you don't care whether the
> dates in the second list come before or after the ones in the first
> list, you can do a couple searchsorteds and a few more closeness
> comparisons. I would be curious if this works and is sufficiently
> performant.
>
> - Wes
>

I should add that this seems like a case where the numpy datetime
dtype would make a huge performance difference.