Cross-validation is an important part of training and evaluating an ML model. It allows you to get an estimate of how a trained model will perform on new data.
Most people who learn how to do cross validation first learn about the K-fold approach. I know I did. In K-fold cross validation, the dataset is randomly split into n folds (usually 5). Over the course of 5 iterations, the model is trained on 4 out of the 5 folds while the remaining 1 acts as a test set for evaluating performance. This is repeated until all 5 folds have been used as a test set at one point in time. By the end of it, you’ll have 5 error scores, which, averaged together, will give you your cross validation score.
Here’s the catch though — this method really only works for non-time series / non sequential data. If the order of the data matters in any way, or if any data points are dependent on preceding values, you cannot use K-fold cross validation.
The reason why is fairly straightforward. If you split up the data into 4 training folds and 1 testing fold using KFold you will randomize the order of the data. Therefore, data points that once preceded other data points can end up in the test set, so when it comes down to it, you’ll be using future data to predict the past.
This is a big no-no.
The way test your model in development should mimic the way it will run in the production environment.
If you’ll be using past data to predict future data when the model goes to production (as you would be doing with time series), you should be testing your model in development the same way.
This is where TimeSeriesSplit comes in. TimeSeriesSplit, a scikit-learn class, is a self-described “variation of KFold.”
In the kth split, it returns first k folds as train set and the (k+1)th fold as test set.
The main differences between TimeSeriesSplit and KFold are:
- In TimeSeriesSplit, the training dataset gradually increases in size, whereas in…