Datasets: Dividing the original dataset Stay organized with collections Save and categorize content based on your preferences.
Page Summary
Machine learning models should be tested against a separate dataset, called the test set, to ensure accurate predictions on unseen data.
It's recommended to split the dataset into three subsets: training, validation, and test sets, with the validation set used for initial testing during training and the test set used for final evaluation.
The validation and test sets can "wear out" with repeated use, requiring fresh data to maintain reliable evaluation results.
A good test set is statistically significant, representative of the dataset and real-world data, and contains no duplicates from the training set.
It's crucial to address discrepancies between the dataset used for training and testing and the real-world data the model will encounter to achieve satisfactory real-world performance.
All good software engineering projects devote considerable energy totesting their apps. Similarly, we strongly recommend testing yourML model to determine the correctness of its predictions.
Training, validation, and test sets
You should test a model against adifferent set of examples than thoseused to train the model. As you'll learna little later, testingon different examples is stronger proof of your model's fitness than testingon the same set of examples.Where do you get those different examples? Traditionally in machine learning,you get those different examples by splitting the original dataset. You mightassume, therefore, that you should split the original dataset into two subsets:
- Atraining setthat the model trains on.
- Atest set for evaluation of thetrained model.

Exercise: Check your intuition
Dividing the dataset into two sets is a decent idea, buta better approach is to divide the dataset intothree subsets.In addition to the training set and the test set, the third subset is:
- Avalidation setperforms the initial testing on the model as it is being trained.

Use thevalidation set to evaluate results from the training set.After repeated use of the validation set suggests that your model ismaking good predictions, use the test set to double-check your model.
The following figure suggests this workflow.In the figure, "Tweak model" means adjusting anything about the model—from changing the learning rate, to adding or removingfeatures, to designing a completely new model from scratch.
The workflow shown in Figure 10 is optimal, but even with that workflow,test sets and validation sets still "wear out" with repeated use.That is, the more you use the same data to make decisions abouthyperparameter settings or other model improvements, the less confidencethat the model will make good predictions on new data.For this reason, it's a good idea to collect more data to "refresh" the testset and validation set. Starting anew is a great reset.
Exercise: Check your intuition
Additional problems with test sets
As the previous question illustrates, duplicate examples can affect model evaluation.After splitting a dataset into training, validation, and test sets,delete any examples in the validation set or test set that are duplicates ofexamples in the training set. The only fair test of a model is againstnew examples, not duplicates.
For example, consider a model that predicts whether an email is spam, usingthe subject line, email body, and sender's email address as features.Suppose you divide the data into training and test sets, with an 80-20 split.After training, the model achieves 99% precision on both the training set andthe test set. You'd probably expect a lower precision on the test set, so youtake another look at the data and discover that many of the examples in the testset are duplicates of examples in the training set. The problem is that youneglected to scrub duplicate entries for the same spam email from your inputdatabase before splitting the data. You've inadvertently trained on some ofyour test data.
In summary, a good test set or validation set meets all of thefollowing criteria:
- Large enough to yield statistically significant testing results.
- Representative of the dataset as a whole. In other words, don't picka test set with different characteristics than the training set.
- Representative of the real-world data that the model will encounteras part of its business purpose.
- Zero examples duplicated in the training set.
Exercises: Check your understanding
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-03 UTC.