[SciPy-User] Matching date lists

Wes McKinney wesmckinn at gmail.com
Thu Feb 4 09:58:00 EST 2010


On Thu, Feb 4, 2010 at 9:14 AM, Cumberland, Burly
<Burly.Cumberland at coherent.com> wrote:
> Hi,
>
>
>
> I have several datasets which are linked to date/timestamps. I import these
> in and convert all the dates to python datetime objects. So for instance I
> might have something like
>
>
>
> Array of datetime objects with an array of data values associated with it,
> say A.
>
> Another array of datetime objects with several arrays of data values
> associated, say B, C and D.
>
>
>
> The time stream is not continuous, i.e. there may be 5-6 days data then
> nothing for a day then 10 days data. While some of the arrays are at regular
> sample intervals (frequency) at least one of them has data generated at a
> higher frequency but with no fixed period. Ideally I would like to associate
> the closest measurement from this list with the datetime stamp from the
> first list. Thus if I have
>
>
>
> array([2009-12-23 13:57:16, 2009-12-23 13:58:15, 2009-12-23 13:59:14,
>
>       2009-12-23 14:00:14, 2009-12-23 14:01:13, 2009-12-23 14:02:13,
>
>       2009-12-23 14:03:13, 2009-12-23 14:04:12, 2009-12-23 14:05:12,
>
>       2009-12-23 14:06:12], dtype=object)
>
>
>
> and
>
>
>
> array([2009-12-23 13:57:21, 2009-12-23 13:57:28, 2009-12-23 13:57:37,
>
>       2009-12-23 13:57:44, 2009-12-23 13:57:53, 2009-12-23 13:58:02,
>
>       2009-12-23 13:58:09, 2009-12-23 13:58:17, 2009-12-23 13:58:25,
>
>       2009-12-23 13:58:33], dtype=object)
>
>
>
> I'd like to tie my values for the 1st and 8th values from the second array
> to the first two values from the first array (assuming I'm happy that the
> data is taken within 5 seconds of each other). Thus I'd mask or disregard
> all the data in the second array set (B, C and D) that isn't measured within
> a reasonable time period of the first.
>
>
>
> Clearly I could write a loop and do comparisons and then copy the data or
> pop it. The problem is there's a quarter of a million items in the data set
> at the minute and it continues to grow. So I was wondering if anyone can
> recommend a method or module, I'd a brief look at timeseries and pandas but
> neither appears to have a tool which resolves this for me.
>
>
>
> Any suggestions welcome.
>
>
>
> Regards,
>
> Burly.
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>

Assuming that measurements in the second array have to happen after
the timestamps in the first array, I would suggest a searchsorted
approach:

indexer = first_list.searchsorted(second_list)

closest_dates = first_list.take(indexer - 1) # need date prior-- does
not deal with date equality though

deltas = (second_list - closest_dates)

mask = deltas < timedelta(seconds=5)

# now do as you wish with the mask and indexer

I think you just need to check that edge cases (beginning and end of
the arrays) are being handled correctly. If you don't care whether the
dates in the second list come before or after the ones in the first
list, you can do a couple searchsorteds and a few more closeness
comparisons. I would be curious if this works and is sufficiently
performant.

- Wes



More information about the SciPy-User mailing list