When tackling a supervised machine learning task, the developers of the machine learning solution often divide the labelled examples available to them into three partitions: a training set, a validation set, and a test set. To understand their differences, it is useful to examine how the need for such division can arise during the development process of machine learning solutions.
When developing solutions to a supervised machine learning task, our ultimate goal is usually to find an classification / regression model that is able to produce accurate predictions for some unseen inputs.
To achieve this, we start by supplying a learning algorithm with the training set; the learning algorithm can then set out to fine-tune the parameters in our model such that the model would make the least amount of mistakes overall when asked to reproduce the correct outputs using the inputs contained in the training set.
At this point, our two main assumptions are that:
The validity of the first assumption can be verified by observing the in-sample error of the model during the training phase --- that is, the amount of errors made by the model when it is apply on the same data it was trained with. To verify the second assumption, we need to evaluate the performance of the trained model on a separate validation set -- a collection of labelled examples that were not shown to the model during its training; the validation set effectively serves as the “unseen inputs” for the trained model.
In practice, the validation set is used to select the model that exhibits the best ability to generalize over unseen data; more specifically, it is used to identify the model that shows the lowest generalization error (also called the out-of-sample error) from many candidate models. Some typical use-cases include:
The training set together with the validation set as a whole is sometimes referred to as the development set, as they are used to develop the machine learning solution.
Once the most performant model is finalized based on the development set, any claims about the performance of the machine learning solution must be reported based on an evaluation of the solution’s performance over the test set -- another separate set of labelled examples withheld from the entire model development process. In most rigorous experiments, the test set should be kept away from the developers to prevent them from compromising the results by adapting their models based on the feedback derived from evaluations with the test set.