Time-series (or other intrinsically ordered data) can be problematic for cross-validation. If some pattern emerges in year 3 and stays for years 4-6, then your model can pick up on it, even though it wasn’t part of years 1 & 2.

An approach that’s sometimes more principled for time series is forward chaining, where your procedure would be something like this:

  • fold 1 : training [1], test [2]
  • fold 2 : training [1 2], test [3]
  • fold 3 : training [1 2 3], test [4]
  • fold 4 : training [1 2 3 4], test [5]
  • fold 5 : training [1 2 3 4 5], test [6]

That more accurately models the situation you’ll see at prediction time, where you’ll model on past data and predict on forward-looking data. It also will give you a sense of the dependence of your modeling on data size.

Splitting Time Series Data into Train/Test/Validation Sets_ci



Splitting Time Series Data into Train/Test/Validation Sets_ci_02



Splitting Time Series Data into Train/Test/Validation Sets_Time_03



Splitting Time Series Data into Train/Test/Validation Sets_Time_04












