[SciPy-user] [timeseries] Missing dates

Matt Knox mattknox.ca at gmail.com
Fri Apr 3 21:03:18 EDT 2009


> In the one plotting example (using yahoo finance) I saw that one can
> fill missing dates before plotting so that the missing ones get
> masked.  Though when applying some moving windows functions that
> caused all periods that were effected by the missing values to also
> become masked, which isn't the behaviour I was expecting.  It does
> make sense to do it that way though.
> 
> Obviously it's simple enough to use the original timeseries to
> calculate the moving window functions, or interpolate or something.

You hit the nail on the head here. There is no way for the timeseries module to
know what the user thinks is the proper way to handle the masked values here, so
the sensible thing to do is mask the whole region. You can calculate the moving
average on the original series (ie. before you call fill_missing_dates), or
interpolate the data somehow first (eg. using forward_fill), etc.

> The question I'm trying to get at though is if I'm going to store my
> timeseries in hdf5 will I fill in the missing dates before I do so, or
> only do that whenever I plot the timeseries?  I'm working with stock
> prices, so the "missing" dates over the weekends will increase file
> size by more then 30%.  Is there any other reason  to fill in missing
> dates besides for plotting?

Note that in the example you are talking about, the series is a "BUSINESS"
frequency series

dates = ts.date_array([q[0] for q in quotes], freq='DAILY').asfreq('BUSINESS')

so calling fill_missing_dates on this has the effect of adding masked values for
the HOLIDAYS, but not Saturday and Sunday.

Now as to whether or not one should fill in the holidays for storage purposes is
a judgement call, but I generally find it simpler to just forward fill all
holidays (see the forward_fill function in the interpolation section of the
docs) in a batch job overnight and that way any reports or models don't have to
think about adding special logic to handle holidays which can be somewhat
complicated, especially if you are talking about global data with different
calendars and so forth. Yes, this can introduce inaccuracies to some degree, but
for most use cases I have found the gains in simplicity more than outweigh those
costs.

- Matt




More information about the SciPy-User mailing list