[SciPy-User] Status of TimeSeries SciKit

Wed Jul 27 13:54:13 EDT 2011

Wes McKinney <wesmckinn <at> gmail.com> writes:

> > Frequency conversion flexibility:> 
> >    - allow you to specify where to place the value - the start or end of the
> >      period - when converting from lower frequency to higher frequency (eg.
> >      monthly to daily)
> 
> I'll make sure to make this available as an option. down going
> low-to-high you have two interpolation options: forward fill (aka
> "pad") and back fill, which I think is what you're saying?
>

I guess I had a bit of a misunderstanding when I wrote this comment because I
was framing things in the context of how I think about the scikits.timeseries
module. Monthly frequency dates (or TimeSeries) in the scikit don't have any
day information at all. So when converting to daily you need to tell it
where to place the value (eg. Jan 1, or Jan 31). Note that this is a SEPARATE
decision from wanting to back fill or forward fill.

However, since pandas uses regular datetime objects, the day of the month is
already embedded in it. A potential drawback of this approach is that to
support "start of period" stuff you need to add a separate frequency,
effectively doubling the number of frequencies. And if you account for
"business day end of month" and "regular day end of month", then you have to
quadruple the number of frequencies. You'd have "EOM", "SOM", "BEOM", "BSOM".
Similarly for all the quarterly frequencies, annual frequencies, and so on.
Whether this is a major problem in practice or not, I don't know.

> >    - support of a larger number of frequencies
> 
> Which ones are you thinking of? Currently I have:
> 
> - hourly, minutely, secondly (and things like 5-minutely can be done,
> e.g. Minute(5))
> - daily / business daily
> - weekly (anchored on a particular weekday)
> - monthly / business month-end
> - (business) quarterly, anchored on jan/feb/march
> - annual / business annual (start and end)

I think it is missing quarterly frequencies anchored at the other 9 months of
the year. If, for example, you work at a weird Canadian Bank like me, then your
fiscal year end is October.

Other than that, it has all the frequencies I care about. Semi-annual would be
a nice touch, but not that important to me and timeseries module doesn't have
it either. People have also asked for higher frequencies in the timeseries
module before (eg. millisecond), but that is not something I personally care
about.

> > Indexing:
> >    - slicing with dates (looks like "truncate" method does this, but would
> >      be nice to be able to just use slicing directly)
> 
> you can use fancy indexing to do this now, e.g:
> 
> ts.ix[d1:d2]
> 
> I could push this down into __getitem__ and __setitem__ too without much work

I see. I'd be +1 on pushing it down into __getitem__ and __setitem__

> > - full missing value support (TimeSeries class is a subclass of MaskedArray)
> 
> I challenge you to find a (realistic) use case where the missing value
> support in pandas in inadequate. I'm being completely serious =) But
> I've been very vocal about my dislike of MaskedArrays in the missing
> data discussions. They're hard for (normal) people to use, degrade
> performance, use extra memory, etc. They add a layer of complication
> for working with time series that strikes me as completely
> unnecessary.

>From my understanding, pandas just uses nans for missing values. So that means
strings, int's, or anything besides floats are not supported. So that
is my major issue with it. I agree that masked arrays are overly complicated
and it is not ideal. Hopefully the improved missing value support in numpy will
provide the best of both worlds.

- Matt