[scikit-learn] Contribution to sklearn: Cross validation of time series

Andreas Mueller t3kcit at gmail.com
Thu May 4 20:28:00 EDT 2017


Not sure if my internet is bad or the pictures you attached are broken.
So you want min_train_size < test_size, or what's the main use case?
Given min_train_size and test_size doesn't entirely define the splits, 
though.
How much do you increase the training set in each iteration?
In TimeSeriesSplit min_train_size = test_size = increase.
Indeed you could choose all three of them separately.
Either way, you either end up not using all of your data or your last 
increase is smaller than all your
other increases - or you parametrize using number of iterations, so that 
you can make it line up with the dataset size.


On 04/28/2017 01:31 PM, andres lago wrote:
>
> Hi Andy,
>
>   sorry, I pushed an unwanted 'send' in the previous message. Thanks 
> for your quick reply. I'll try to be more precise with the CV I'm 
> proposing. Comparing to the actual implementation (TimeSeriesSplit), 
> these would be the new parameters:
>
>    1-CV mode: Rolling window Or Variable length window:
>
>       > Rolling window: keeps the same size of CV-training set for all 
> folds, shifting forward at each iteration of CV.
>
>       > Variable length window: increments the size of CV-training set 
> at each fold iteration (actual implementation in TimeSeriesSplit).
>
>
>   2-minimum size of CV-training set: Initial size of CV-training set. 
> It's the minimum number of observations required to do the first 
> predictions.
>
>
>   3- size of CV-test set: Size of the CV-test set. It's constant for 
> all folds. Should have the size of the prediction horizon.
>
>
>   The number of folds is not required anymore, it's automatically 
> calculated from fields 2 & 3.
>
>
>   The idea behind this contribution is to cover some common use cases 
> around the CV that today is impossible with TimeSeriesSplit:
>
>     -Your data doesn't show seasonality, your dataset is huge then 
> you'd like to perform CV with a rolling window to accelerate the CV
>
>     -The client asked for a prediction horizon of 7 days, you'd like 
> to perform the tests in CV with this horizon
>
>     -The data  has a strong seasonality, you want to fit at least 1 
> month of observations before the first prediction in CV
>
>
>   Please find enclosed some graphics to ease understanding the proposal.
>
>
>   Regards,
>
>     Andrés
>
>
>
>
>
>
>
> ------------------------------------------------------------------------
> *De:* scikit-learn 
> <scikit-learn-bounces+a_lago=hotmail.com at python.org> en nombre de 
> Andreas Mueller <t3kcit at gmail.com>
> *Enviado:* viernes, 28 de abril de 2017 05:48 p. m.
> *Para:* Scikit-learn user and developer mailing list
> *Asunto:* Re: [scikit-learn] Contribution to sklearn: Cross validation 
> of time series
> Hey Andres.
> I think there might be a PR for that.
> Can you explain the minimum size of the training set? How is that used?
> I thought the other main option would be "rolling window" cross validation
> to use a fixed length cv training set.
>
> So the two options to me were rolling window and what we're doing 
> right now.
> Can you elaborate on the other use cases, like minimum size of the 
> training set
> and why you would want the other options with a variable length 
> training set?
>
> Thanks,
> Andy
>
> On 04/27/2017 09:44 AM, andres lago wrote:
>>
>> Hello,
>>
>>   I'd like to contribute with a new functionality in sklearn. It's 
>> the cross validation of time series. It's an evolution of the 
>> current functionality, implemented by TimeSeriesSplit.
>>
>>
>> TimeSeriesSplit only allows the user to set the number of folds. In 
>> real life, when performing the cross validation of time series, other 
>> parameters are required, for instance:
>>
>>     -minimum size of CV-training set
>>
>>     -size of CV-test set
>>
>>     -fixed or variable length of CV-training set.
>>
>>
>>   The functionality is inspired by the R library 'caret'.
>>
>>
>>   If you agree, I can share my code. I developed it for a project 
>> with the french rail company SNCF. It's in production now.
>>
>>
>>   Regards,
>>
>>     Andres
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170504/b1aa01bc/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 104405 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170504/b1aa01bc/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 48958 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170504/b1aa01bc/attachment-0003.png>


More information about the scikit-learn mailing list